-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rolling functions #9
base: main
Are you sure you want to change the base?
Conversation
As there was no feedback on the scope proposed by me in April, I made another step forward and defined it more precisely. Please have a look at the questions proposed. It is not easy, within 10 questions, to cover well all the possible features, so I ended up focusing on:
Looking forward to feedback on the scope for that test, or implementations in other software. |
|
Instead of q10 UDF I would propose either:
From two options above I am in favor of Posting here before doing the change as I hope there may be some other ideas. edit: amended in 045b7d5 |
Other topics that are subject to community review are:
w = nrow(x)/1e3 ## used in 8 out of 10 questions
wsmall = nrow(x)/1e4 ## used q2
wbig = nrow(x)/1e2 ## used q3 In case of 1e9 data size, window size would be 1e6, which feels unrealistically big. I feel we could improve window sizes.
DT[["id2"]] = sort(sample(N*1.1, N)) ## index dense
DT[["id3"]] = sort(sample(N*2, N)) ## index sparse dense index is 110% range of nrow. |
@jangorecki you have a solid set of measures, the only other type I'd consider would be differencing |
I think it would be helpful if the dplyr test was documented as a representation of the slider package, as opposed to using RcppRoll. The dplyr benchmark has been confusing to me because you can use dplyr with duckdb or data.table. For data.table, it is more obvious that you are benchmarking the rolling functions inside the data.table package. |
@AdrianAntico that is an interesting idea, but as I briefly looked at potential implementation, it doesn't seem to stress windowing computation (use of @rdavis120 maintaining new solutions is relatively high cost, putting slider under dplyr was easy way to avoid that. Rather than adding new solution slider I would prefer to rename dplyr to tidyverse so it will fit well and there will be no need for adding another solution. Anyway I would prefer to keep this issue discussion around rolling task scope rather naming details. |
@jangorecki the intent was more time series related (and cross-row related). I have a diff function in my github package "Rodeo" if you want a full example, starting at line 443... https://github.com/AdrianAntico/Rodeo/blob/main/R/FeatureEngineering_CrossRowOperations.R |
@Tmonster could we get CI workflow approval? |
@Tmonster Is there a way that I could be approving CI runs? does it run on duckdblabs private runners? if not then I don't think there should be any concerns. I added pandas rollfun script, and (hopefully) fixed failure in previous GH Actions job. |
@jangorecki - which is the reference implementation of all the questions from Q1 to Q10 (I am asking, because I have checked several solutions and they indicate "not implemented yet" and I do not know how to exactly reproduce them in Julia) In particular I am not clear what
Also I noticed the comment:
So the question is if you allow specialized functions or the opposite - you do not allow them and accept that the process does not produce the result (this was the approach in your earlier benchmarks - you wanted to check what is the performance of "out of the box" solutions without writing custom code tuned for performance). |
@bkamins rolling regression is only now available for duckdb for now. frollreg will most likely never exist as there are already nice implementations of rolling regression in other R packages. As for the design of q10 I would lean toward finding most popular rolling regression question on stackoverflow and aligning to it. It is quite frequently requested functionality, that's why I decided to have it in scope of this task. Doing rolling regression with |
But then - back to my original question => which is the reference implementation of the questions I should match the results against in Julia implementation? (in particular - do you want to include constant term in the regression and what should be returned from the operation) Thank you! |
That haven't been settled yet. I need to go through stackoverflow to find common questions about it. Actually it would be helpful if you could propose one which you believe is the common problem that users are looking to solve with rolling regression. |
Incidentally, just today I saw an example from a user. The user wanted to run
(so both intercept and slope are kept in a single entry) |
What I did for duckdb q10 for now is r^2, because this is what we used in groupby q9 |
duckdb, spark, pandas q8 q9 only - do not have an option for handling properly an incomplete rolling window. Timings for those solutions will not include required postprocessing to match exactly same result (NULL vs value from an unexpected window size) as the overhead would be too big. |
Development is possibly finished on this branch. 5 solutions added till now have been validated using https://github.com/jangorecki/db-benchmark/blob/rollfun/_utils/rollfun-ans-validation.txt @bkamins if you would like to add Julia, you are welcome, please use commands from file linked above to validate answers against one of the solutions. Once I will confirm report is producing fine (after running whole rollfun bench) then PR will be ready to merge. |
@Tmonster PR is ready to merge To reproduce # install R and python
git clone https://github.com/jangorecki/db-benchmark --branch rollfun --single-branch --depth 1
cd db-benchmark
# install solutions interactively
./dplyr/setup-dplyr.sh
./datatable/setup-datatable.sh
./pandas/setup-pandas.sh
./duckdb-latest/setup-duckdb-latest.sh
./spark/setup-spark.sh
# prepare data
Rscript _data/rollfun-datagen.R 1e6 NA 0 1
Rscript _data/rollfun-datagen.R 1e7 NA 0 1
Rscript _data/rollfun-datagen.R 1e8 NA 0 1
mkdir data
mv R1_1e*_NA_0_1.csv data
vim run.conf
# do_upgrade false
# force run true
# report false
# publish false
# run_task rollfun
# run_solution data.table dplyr pandas duckdb-latest spark
sudo swapoff -a
# workaround for #30: `=~` matching in run.sh causes "duckdb-latest" to match "duckdb"
vim run.sh
# comment out 76 line in rush.sh: if [[ "$RUN_SOLUTIONS" =~ "duckdb" ]]; then ./duckdb/ver-duckdb.sh; fi;
./run.sh > ./run.out |
@Tmonster any idea if PR can make it to master before scheduled September's run? |
draft version for rolling functions requested in #6
we can define scope for those tests here. Some are already there, others only mentioned in comments
PR roadmap:
optionally: test scripts and validate results:
rolling functions not available in: