Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest UDAF. #4395

Closed
big-andy-coates opened this issue Jan 28, 2020 · 8 comments
Closed

Latest UDAF. #4395

big-andy-coates opened this issue Jan 28, 2020 · 8 comments
Projects

Comments

@big-andy-coates
Copy link
Contributor

big-andy-coates commented Jan 28, 2020

Many users are requesting some variant of a LATEST udaf.

For example, consider a stream of IOT data:

{"sensorId": 1224, "temp": 56.4, "quality": 3},
{"sensorId": 2658, "temp": 56.4, "quality": 3},
{"sensorId": 1224, "temp": 96.4, "quality": 1}

Users would like to be able to build a table from this data, capturing the latest temp and quality for a given sensor.

Such an operation should be possible if we add a LATEST UDAF. Note, the UDAF will only work on stream sources. It's not possible to support LATEST on table sources due to the fact that tables can have values retracted.

The UDAF would take two parameters:

  • The field storing the timestamp of the row, (e.g. ROWTIME, but potentially something else)
  • The value to store.

The UDAF will only update the value if the timestamp of the new value is greater.

Table of data can be built with something like:

CREATE TABLE AS 
   SELECT sensorId, LATEST(temp, ROWTIME), LATEST(quality, ROWTIME)
   FROM DATA_STREAM
   GROUP BY sensorId;
@big-andy-coates
Copy link
Contributor Author

To support this, we'll need to extend UDAFs to support more parameters: which is a long overdue feature in itself.

@big-andy-coates big-andy-coates added this to To do in UDFs via automation Jan 28, 2020
@big-andy-coates
Copy link
Contributor Author

Related / duplicates:
#3985
#1128

@big-andy-coates
Copy link
Contributor Author

Might be best called ARGMAX UDAF, and matching ARGMIN.

@miguno
Copy link
Contributor

miguno commented Feb 4, 2020

Might be best called ARGMAX UDAF, and matching ARGMIN.

One question is: Should LATEST (or whatever its name would be) return only a specific data field vs. the full row? The little prior art I found pick the latter (fwiw, no such support is in the SQL standard from what I can tell):

  • In math, argmax returns the x value(s) for which f(x) is the maximum of the function f. (The x values would mean the full row, not just one data field.)
  • I came across arg_max() in Azure Data Explorer, which finds the row where an expression is maximized (for latest, we could argue that the value of ROWTIME should be maximized).
  • Otherwise, people seem to use workarounds to express this in SQL, see e.g. Get row with highest or lowest value from a GROUP BY.

@agavra
Copy link
Contributor

agavra commented Feb 4, 2020

@miguno - this was my suggestion. I thought I had found some precedent for that somewhere... I'll let you know if I find that

@PeterLindner
Copy link

PeterLindner commented Feb 4, 2020

@miguno I guess I don't get your point.

One question is: Should LATEST (or whatever its name would be) return only a specific data field vs. the full row?

According to the function signature Andy implied above it has to be only a specific data field, otherwise LATEST needs to live in the FROM clause (ie something like SELECT * FROM LATEST(stream)) but I'm not sure what that should do and whether that is different from just reading the stream

In math, argmax returns the x value(s) for which f(x) is the maximum of the function f. (The x values would mean the full row, not just one data field.)

I don't agree with your conclusion. If x is an integer, ARGMAX would return an integer according to the definition you gave, otherwise x would have to be some kind of struct. Caution has to be taken when x is ambiguous though, then some tie-breaking has to be done (or maybe a map could be returned but that probably won't be a good idea)

I came across arg_max() in Azure Data Explorer, which finds the row where an expression is maximized (for latest, we could argue that the value of ROWTIME should be maximized).

LATEST, like Andy described it, is a subset of ARGMAX so why not implement the more generic one when the algorithm is essentially the same? ARGMIN and ARGMAX are actually quite powerful and I'm sure I'll need them analyzing our IIoT data (eg. finding position of tooling at maximum pressure during each production cycle)

Otherwise, people seem to use workarounds to express this in SQL, see e.g. Get row with highest or lowest value from a GROUP BY.

the proper solution for the linked problem using ARGMIN (if it existed) would be:

SELECT 
  ARGMIN(value, id) AS id,
  name,
  MIN(value) AS value
FROM test
GROUP BY name;

but I think if you want the whole row back, then something along those lines would be more useful:

SELECT
  *
FROM test
GROUP BY name
HAVING value = MIN(value)

(I'm not a SQL guy, so I'm not sure if the last query actually works, but I think it should work 😉; also again not sure what would/should happen if value is ambiguous)

@big-andy-coates
Copy link
Contributor Author

Closing as a duplicate of #3985

UDFs automation moved this from To do to Done Feb 20, 2020
@purplefox
Copy link
Contributor

One question is: Should LATEST (or whatever its name would be) return only a specific data field vs. the full row?

I think it should only return a single field, as you might want to combine it with other aggregate functions in the same query, e.g.

select sensor_id, latest(temperature), max(temperature), min(temperature)
from sensor_readings
group by sensor_id

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
UDFs
  
Done
Development

No branches or pull requests

5 participants