Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cpp11 for rpkg #2974

Merged
merged 40 commits into from
Jan 26, 2022
Merged

Use cpp11 for rpkg #2974

merged 40 commits into from
Jan 26, 2022

Conversation

nbenn
Copy link
Contributor

@nbenn nbenn commented Jan 22, 2022

This supersedes PR #2948 (which I will close in favor of this). In addition to #2948, this

  • fixes a compiler warning that has snuck into Use cpp11 package #2948
  • adds some infrastructure for performance regression testing of the R package

Currently, this generates data (2 int cols, 1 double col, 1 string col & 1 factor col), with {1e3, 1e5, 1e7} rows and the following benchmarks are run

  • write_df: data is written using dbWriteTable() and removed again with dbRemoveTable()
  • register_df: data is registered using duckdb_register() and unregistered with duckdb_unregister()
  • register_arrow: an arrow InMemoryDataset is registered using duckdb_register_arrow() and unregistered with duckdb_unregister_arrow()
  • read_from_tbl: a native table is read using dbReadTable()
  • select_from_tbl: a row subset of a native table is read using dbGetQuery()
  • read_from_df: dbReadTable() is used to read a registered data.frame
  • select_from_df: a row subset is read from a registered data.frame
  • read_from_arw: dbReadTable() is used to read a registered arrow table
  • select_from_arw: a row subset is read from a registered arrow table

Row subsetting is done as SELECT * FROM table WHERE column > 50 and column is generated as U(0, 100). Benchmarks are repeated 50 times.

@hannesmuehleisen I'm happy to add (or remove) benchmarks if there are relevant aspects that are not yet covered. This is intended as a starting point.

I have attached results of such a benchmark, comparing this (cpp11) to both master and v0.3.1 (release). In the spirit of

# If there is a diference of 10% in regression on any query the build breaks.

red indicates slower and green faster that master +/- 10% (comparing median runtimes). If the threshold is set to 15% as actually seems to be the case in

regression_test(0.15,benchmark)

failures only remain in the nrows = 1000 setting. We could of course automate some of this in CI for e.g. a 10e6 row setting or so and fail if slower than master + 15%, similarly to what is done in regression_test.py.

perf_reg

@hannes
Copy link
Member

hannes commented Jan 23, 2022

Thanks for the benchmark, what was GC level again?

@nbenn
Copy link
Contributor Author

nbenn commented Jan 24, 2022

R uses a generational garbage collector, which divides allocated nodes into generations based on some notion of age. Younger generations are collected more frequently than older ones. From R-ints:

There are three levels of collections. Level 0 collects only the youngest generation, level 1 collects the two youngest generations and level 2 collects all generations. After 20 level-0 collections the next collection is at level 1, and after 5 level-1 collections at level 2. Additionally, if a level-n collection fails to provide 20% free space (for each of nodes and the vector heap), the next collection will be at level n+1. The R-level function gc() performs a level-2 collection

The people behind the benchmarking package bench, which is used here, have decided that it makes most sense to filter out runs where garbage collections occurred, due to this making things a bit less directly comparable. So for the slower/same/faster comparison, only red dots are actually taken into account. Personally, I don't find it obvious that it always is a good idea to do so. After all, if you're comparing an implementation that triggers GC more frequently, you'd want this penalized in some form. We do have the number of GC invocations per run available (together with other things, such as total allocated memory) and if you're interested in seeing this as well, I'm happy to incorporate this information somehow, or share the raw data as well.

As a side remark: results are not super stable. I just reran and this time, for both thresholds of 0.1 and 0.15, "slower" cases are only present in the nrow: 1000 setting.

@nbenn nbenn requested a review from hannes January 26, 2022 07:56
@hannes
Copy link
Member

hannes commented Jan 26, 2022

Can you please merge with master, should fix the failing test.

@nbenn
Copy link
Contributor Author

nbenn commented Jan 26, 2022

@hannesmuehleisen Thanks!

Can you please merge with master, should fix the failing test.

If this was directed at me, Gh is telling me that I'm not authorized to merge the PR.

@hannes
Copy link
Member

hannes commented Jan 26, 2022

@nbenn no I mean merge with the latest upstream master and commit the merge to this branch

@hannes hannes merged commit c8fc684 into duckdb:master Jan 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants