rolling functions #9

jangorecki · 2023-04-24T19:09:07Z

draft version for rolling functions requested in #6

we can define scope for those tests here. Some are already there, others only mentioned in comments

PR roadmap:

optionally: test scripts and validate results:

juliadf
polars - @ritchie46 any chance for the script?
clickhouse - https://clickhouse.com/docs/en/sql-reference/window-functions
datafussion - @Dandandan https://arrow.apache.org/datafusion/user-guide/sql/window_functions.html
dask - https://docs.dask.org/en/stable/dataframe-api.html#rolling-operations

rolling functions not available in:

pydatatable - Rolling aggregate support based on windows within a DT h2oai/datatable#1500
juliads
arrow - support for arrow data frame r-lib/slider#195

jangorecki · 2023-06-21T15:33:47Z

As there was no feedback on the scope proposed by me in April, I made another step forward and defined it more precisely. Please have a look at the questions proposed. It is not easy, within 10 questions, to cover well all the possible features, so I ended up focusing on:

funs: mean, median (moving holistic aggregate), regression, UDF
scalability: window small or big (length of the input is not part of the questions, it is parameter of the script, so this scalability is always tested)
features: vectorized input, weighted fun, uneven time series, by.column=F

q1: rolling mean
q2: window small
q3: window big
q4: multi vars+cols
q5: median
q6: weighted
q7: uneven dense
q8: uneven sparse
q9: regression (by.column)
q10: UDF (one that will generally not be optimized, to mimic arbitrary UDF)

Looking forward to feedback on the scope for that test, or implementations in other software.
Note that we may be forced to move from N {1e6, 1e7, 1e8} to N {1e5, 1e6, 1e7} depending on how well other tools will scale.

jangorecki · 2023-06-24T17:19:28Z

I pushed implementation for dplyr (using slider) as well. So we have at the moment data.table and dplyr.
All the other scripts and control csv files are amended for new task, so it is now nicely run-able just by calling ./run.sh.
I commented out q10 "UDF" for now. It is completely not comparable with other questions, being x1000-10000 slower than the rest. We either have to figure out some very lightweight UDF, or just drop UDF from the scope.

jangorecki · 2023-06-25T19:30:44Z

Instead of q10 UDF I would propose either:

rolling min - still it is quite different implementation than mean/sum
second holistic aggregate - but here I don't see that much value added - looking at https://duckdb.org/2021/11/12/moving-holistic.html - mode is not useful for our float64 measure variable (we could as well add low/medium cardinality int variable just for mode question), median is already inside this benchmark, quantile is not much different than median, median is just special case of quantile.
??

From two options above I am in favor of min.

Posting here before doing the change as I hope there may be some other ideas.

edit: amended in 045b7d5

jangorecki · 2023-06-25T19:41:33Z

Other topics that are subject to community review are:

data sizes, either

1e5, 1e6, 1e7 (could be desired to run slower tools)
1e6, 1e7, 1e8 (currently implemented, feels fine)
1e7, 1e8, 1e9 (if we drop UDF in q10 this should be do-able)
1e5, 1e7, 1e9 (mix of all)

window size

w = nrow(x)/1e3       ## used in 8 out of 10 questions
wsmall = nrow(x)/1e4  ## used q2
wbig = nrow(x)/1e2    ## used q3

In case of 1e9 data size, window size would be 1e6, which feels unrealistically big. I feel we could improve window sizes.

unevenly ordered series (q7, q8 or in recent HEAD q8, q9)

DT[["id2"]] = sort(sample(N*1.1, N))         ## index dense
DT[["id3"]] = sort(sample(N*2, N))           ## index sparse

dense index is 110% range of nrow.
sparse index is 200% range of nrow.
I am not sure if we are stressing well enough the sparse scenario. It could be even 1000% range of nrow. Then the problem is that we cannot easily use 1e9 data size, because index would be in range 1 to 1e10 and many tools would be excluded (maybe its fine?). Using 200% range of nrow, still fits into int32 type.

AdrianAntico · 2023-06-25T22:55:40Z

@jangorecki you have a solid set of measures, the only other type I'd consider would be differencing

era127 · 2023-06-26T22:03:38Z

I think it would be helpful if the dplyr test was documented as a representation of the slider package, as opposed to using RcppRoll. The dplyr benchmark has been confusing to me because you can use dplyr with duckdb or data.table. For data.table, it is more obvious that you are benchmarking the rolling functions inside the data.table package.

jangorecki · 2023-06-27T19:02:49Z

@AdrianAntico that is an interesting idea, but as I briefly looked at potential implementation, it doesn't seem to stress windowing computation (use of diff in R). If you could provide an example of code then it will be more clear.

@rdavis120 maintaining new solutions is relatively high cost, putting slider under dplyr was easy way to avoid that. Rather than adding new solution slider I would prefer to rename dplyr to tidyverse so it will fit well and there will be no need for adding another solution. Anyway I would prefer to keep this issue discussion around rolling task scope rather naming details.

AdrianAntico · 2023-06-27T20:01:26Z

@jangorecki the intent was more time series related (and cross-row related). I have a diff function in my github package "Rodeo" if you want a full example, starting at line 443... https://github.com/AdrianAntico/Rodeo/blob/main/R/FeatureEngineering_CrossRowOperations.R

jangorecki · 2023-07-18T08:28:38Z

@Tmonster could we get CI workflow approval?

jangorecki · 2023-07-20T19:52:43Z

@Tmonster Is there a way that I could be approving CI runs? does it run on duckdblabs private runners? if not then I don't think there should be any concerns.

I added pandas rollfun script, and (hopefully) fixed failure in previous GH Actions job.

bkamins · 2023-07-27T12:19:49Z

@jangorecki - which is the reference implementation of all the questions from Q1 to Q10 (I am asking, because I have checked several solutions and they indicate "not implemented yet" and I do not know how to exactly reproduce them in Julia)

In particular I am not clear what frollreg(list(x$v1, x$v2), w)) should mean in data.table. You input two vectors and want to do a regression on them, but it seems strange for two reasons:

you regress vector on a vector (typically features would be a matrix);
you do not include intercept, so do you want a model v1 ~ 0 + v2 in standard notation and want to return only the coefficient estimated for v2?

Also I noticed the comment:

## Killed, UDF simply does not scale, needs to be specialized fun

So the question is if you allow specialized functions or the opposite - you do not allow them and accept that the process does not produce the result (this was the approach in your earlier benchmarks - you wanted to check what is the performance of "out of the box" solutions without writing custom code tuned for performance).

jangorecki · 2023-07-27T16:26:20Z

@bkamins rolling regression is only now available for duckdb for now. frollreg will most likely never exist as there are already nice implementations of rolling regression in other R packages. As for the design of q10 I would lean toward finding most popular rolling regression question on stackoverflow and aligning to it. It is quite frequently requested functionality, that's why I decided to have it in scope of this task.

Doing rolling regression with frollapply(lm, by.column=F) (or UDF in any other solution) is possible but will be 100-1000 times slower than a specialized version, therefore IMO doesn't make sense to include rolling regression via generic UDF interfaces.
So for rolling regression we want only specialized funs. An (unoptimized) UDF question (initially proposed) went out of scope due to terrible scaling.

bkamins · 2023-07-27T18:07:36Z

But then - back to my original question => which is the reference implementation of the questions I should match the results against in Julia implementation? (in particular - do you want to include constant term in the regression and what should be returned from the operation)

Thank you!

jangorecki · 2023-07-27T18:48:53Z

That haven't been settled yet. I need to go through stackoverflow to find common questions about it. Actually it would be helpful if you could propose one which you believe is the common problem that users are looking to solve with rolling regression.

bkamins · 2023-07-27T20:26:32Z

Incidentally, just today I saw an example from a user. The user wanted to run y ~ x kind of regression and keep the result as a vector of collections: Something like:

3-element Vector{Vector{Float64}}:
 [0.11728235062958436, 0.9228342578148421]
 [0.2160138268973083, 0.41776928538024183]
 [0.42587771406039454, 0.10348333203334836]

(so both intercept and slope are kept in a single entry)

jangorecki · 2023-07-28T16:34:58Z

What I did for duckdb q10 for now is r^2, because this is what we used in groupby q9

jangorecki · 2023-07-31T14:11:20Z

duckdb, spark, pandas q8 q9 only - do not have an option for handling properly an incomplete rolling window. Timings for those solutions will not include required postprocessing to match exactly same result (NULL vs value from an unexpected window size) as the overhead would be too big.

jangorecki · 2023-07-31T21:45:14Z

Development is possibly finished on this branch. 5 solutions added till now have been validated using https://github.com/jangorecki/db-benchmark/blob/rollfun/_utils/rollfun-ans-validation.txt

@bkamins if you would like to add Julia, you are welcome, please use commands from file linked above to validate answers against one of the solutions.

Once I will confirm report is producing fine (after running whole rollfun bench) then PR will be ready to merge.

chksum matches exactly, last val diff 1e6 q8 is 0.00000000000004001796 previously discussed in h2oai's h2oai#95 and h2oai#136 and addressed in 7eacaa1

jangorecki · 2023-08-12T12:45:02Z

@Tmonster PR is ready to merge

To reproduce

# install R and python

git clone https://github.com/jangorecki/db-benchmark --branch rollfun --single-branch --depth 1
cd db-benchmark
# install solutions interactively
./dplyr/setup-dplyr.sh
./datatable/setup-datatable.sh
./pandas/setup-pandas.sh
./duckdb-latest/setup-duckdb-latest.sh
./spark/setup-spark.sh

# prepare data
Rscript _data/rollfun-datagen.R 1e6 NA 0 1
Rscript _data/rollfun-datagen.R 1e7 NA 0 1
Rscript _data/rollfun-datagen.R 1e8 NA 0 1
mkdir data
mv R1_1e*_NA_0_1.csv data

vim run.conf
# do_upgrade false
# force run true
# report false
# publish false
# run_task rollfun
# run_solution data.table dplyr pandas duckdb-latest spark

sudo swapoff -a

# workaround for #30: `=~` matching in run.sh causes "duckdb-latest" to match "duckdb"
vim run.sh
# comment out 76 line in rush.sh: if [[ "$RUN_SOLUTIONS" =~ "duckdb" ]]; then ./duckdb/ver-duckdb.sh; fi; 

./run.sh > ./run.out

jangorecki · 2023-09-18T08:24:42Z

@Tmonster any idea if PR can make it to master before scheduled September's run?

rolling functions

5eb8512

Tmonster mentioned this pull request May 8, 2023

Add additional data wrangling methods #6

Open

jangorecki added 2 commits June 17, 2023 16:36

Merge branch 'master' into rollfun

16f6d5d

rollfun task scope polishing, DT implementation

67b7fbf

comments in datagen

554eedc

jangorecki marked this pull request as ready for review June 22, 2023 10:51

jangorecki added 2 commits June 22, 2023 13:14

define udf in q10

ff52fd8

rollfun set up, dt and dplyr for now

7e265ef

dt to dplyr validation

29ada82

jangorecki mentioned this pull request Jun 25, 2023

new task: timeseries h2oai/db-benchmark#139

Open

jangorecki added 6 commits July 2, 2023 18:51

Merge branch 'master' into rollfun

0df2fbd

rollfun CI

b6dd51e

rollfun questions amended

045b7d5

comments and readme

ec3ebc0

Merge branch 'master' into rollfun

6366a17

run conf back to master

96a2eb0

jangorecki added 4 commits July 20, 2023 21:15

pandas rollfun

7ef67c2

done

572085b

enable pandas rollfun config

57e04b2

use standard R repo for GH Actions

c82a962

jangorecki and others added 2 commits July 25, 2023 06:15

spark rollfun disable median

e064aab

spark rollfun fixes

fac19b0

bkamins mentioned this pull request Jul 25, 2023

Some comments regarding documentation JeffreySarnoff/WindowedFunctions.jl#1

Open

jangorecki and others added 5 commits July 29, 2023 23:02

duckdb rollfun workaround for partial window

657cee0

q10 update r2 rather than v1, duckdb rollfun

3b60271

q6 update colnames, duckdb rollfun

af5991f

history report, rollfun

3eb8394

spark rollfun workaround for partial window

241f593

jangorecki added 2 commits July 31, 2023 23:25

rollfun validation post fixes

05e1660

dt rollfun q8 q9 needs exact

edea136

jangorecki and others added 7 commits August 4, 2023 10:36

DT q8 q9 uses algo=fast again as roundoff is way less than tolerance

3c114e5

chksum matches exactly, last val diff 1e6 q8 is 0.00000000000004001796 previously discussed in h2oai's h2oai#95 and h2oai#136 and addressed in 7eacaa1

readme rollfun

9b7e57c

cleanup dev code

076b573

report for rollfun

546eb28

workaround for using branch rather than master

5bb0922

more pkgs required for dplyr script

b104f3a

rollfun timeout exceptions after aws run

1fa7420

improve how pretty handles edge cases, seen in R1_1e6_NA_0_1 advanced

03dac72

jangorecki mentioned this pull request Sep 6, 2023

adapt uneven time series rolling window Rdatatable/data.table#5576

Open

update DT for added rollmin and rollmedian

e07185e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rolling functions #9

rolling functions #9

jangorecki commented Apr 24, 2023 •

edited

Loading

jangorecki commented Jun 21, 2023 •

edited

Loading

jangorecki commented Jun 24, 2023 •

edited

Loading

jangorecki commented Jun 25, 2023 •

edited

Loading

jangorecki commented Jun 25, 2023 •

edited

Loading

AdrianAntico commented Jun 25, 2023

era127 commented Jun 26, 2023

jangorecki commented Jun 27, 2023

AdrianAntico commented Jun 27, 2023

jangorecki commented Jul 18, 2023

jangorecki commented Jul 20, 2023

bkamins commented Jul 27, 2023 •

edited

Loading

jangorecki commented Jul 27, 2023 •

edited

Loading

bkamins commented Jul 27, 2023

jangorecki commented Jul 27, 2023

bkamins commented Jul 27, 2023

jangorecki commented Jul 28, 2023

jangorecki commented Jul 31, 2023 •

edited

Loading

jangorecki commented Jul 31, 2023 •

edited

Loading

jangorecki commented Aug 12, 2023 •

edited

Loading

jangorecki commented Sep 18, 2023

rolling functions #9

Are you sure you want to change the base?

rolling functions #9

Conversation

jangorecki commented Apr 24, 2023 • edited Loading

jangorecki commented Jun 21, 2023 • edited Loading

jangorecki commented Jun 24, 2023 • edited Loading

jangorecki commented Jun 25, 2023 • edited Loading

jangorecki commented Jun 25, 2023 • edited Loading

AdrianAntico commented Jun 25, 2023

era127 commented Jun 26, 2023

jangorecki commented Jun 27, 2023

AdrianAntico commented Jun 27, 2023

jangorecki commented Jul 18, 2023

jangorecki commented Jul 20, 2023

bkamins commented Jul 27, 2023 • edited Loading

jangorecki commented Jul 27, 2023 • edited Loading

bkamins commented Jul 27, 2023

jangorecki commented Jul 27, 2023

bkamins commented Jul 27, 2023

jangorecki commented Jul 28, 2023

jangorecki commented Jul 31, 2023 • edited Loading

jangorecki commented Jul 31, 2023 • edited Loading

jangorecki commented Aug 12, 2023 • edited Loading

jangorecki commented Sep 18, 2023

jangorecki commented Apr 24, 2023 •

edited

Loading

jangorecki commented Jun 21, 2023 •

edited

Loading

jangorecki commented Jun 24, 2023 •

edited

Loading

jangorecki commented Jun 25, 2023 •

edited

Loading

jangorecki commented Jun 25, 2023 •

edited

Loading

bkamins commented Jul 27, 2023 •

edited

Loading

jangorecki commented Jul 27, 2023 •

edited

Loading

jangorecki commented Jul 31, 2023 •

edited

Loading

jangorecki commented Jul 31, 2023 •

edited

Loading

jangorecki commented Aug 12, 2023 •

edited

Loading