Skip to content

Adding progress bar display func#951

Merged
krlmlr merged 14 commits into
duckdb:mainfrom
meztez:main
Feb 14, 2025
Merged

Adding progress bar display func#951
krlmlr merged 14 commits into
duckdb:mainfrom
meztez:main

Conversation

@meztez

@meztez meztez commented Jan 5, 2025

Copy link
Copy Markdown
Contributor

For #199.

@meztez

meztez commented Jan 5, 2025

Copy link
Copy Markdown
Contributor Author

Not printing, looking for further insight. @krlmlr? Any ideas.

@krlmlr

krlmlr commented Jan 5, 2025

Copy link
Copy Markdown
Collaborator

Thanks for working on it! I think this should be an R option with a callback that is called when the option is set, see duckdb.materialize_callback for an example.

@krlmlr

krlmlr commented Jan 5, 2025

Copy link
Copy Markdown
Collaborator

Or, perhaps even a slot in the duckdb_connection class?

@meztez

meztez commented Jan 5, 2025

Copy link
Copy Markdown
Contributor Author

Well, I'm almost done with the callback. So I'll test that first.

@meztez meztez force-pushed the main branch 3 times, most recently from 0e03670 to b3b600e Compare January 6, 2025 03:52
@meztez

meztez commented Jan 6, 2025

Copy link
Copy Markdown
Contributor Author

@meztez

meztez commented Jan 6, 2025

Copy link
Copy Markdown
Contributor Author

All right, it works, now ironing out the bugs.

@meztez

meztez commented Jan 6, 2025

Copy link
Copy Markdown
Contributor Author
library(duckdb)
library(cli)

progress <- function(x) {
  if (cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv)
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
  if (x > 100) {
    cli::cli_progress_done(.envir = .GlobalEnv)
  }
}
options("duckdb.progress_display" = progress)
conn <- duckdb::dbConnect(duckdb::duckdb())
duckdb::dbSendQuery(conn, "SET progress_bar_time = 0;")
q <- "CREATE OR REPLACE TABLE BOB AS (
      SELECT * FROM 'ldbc-sf300-comments-creationDate.parquet')"
duckdb::dbSendQuery(conn, q)

Mytherin added a commit to duckdb/duckdb that referenced this pull request Jan 6, 2025
#ifndef DUCKDB_DISABLE_PRINT seems redundant since it is already used in
printer.cpp and it prevents from using a display set via
config.create_display_func when compiled with flag
-DDUCKDB_DISABLE_PRINT, like the duckdb-r package, where I'm trying to
implement a display.

https://github.com/duckdb/duckdb/blob/main/src/common/printer.cpp
duckdb/duckdb-r#951

PrintProgress -> TerminalProgressBarDisplay::Update ->
TerminalProgressBarDisplay::PrintProgressInternal -> Printer::RawPrint
and there is a macro there.

Plus there is already a config option to enable_progress_bar and default
is FALSE.

So. Can it be remove?
cc: @krlmlr
@meztez

meztez commented Jan 6, 2025

Copy link
Copy Markdown
Contributor Author

I'm done on this one. Let me know if this works for you.

@e-kotov

e-kotov commented Jan 6, 2025

Copy link
Copy Markdown

Testing with {spanishoddata}:

library(spanishoddata)
library(duckdb)
library(tidyverse)


x_dates <- c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04")
x <- spod_get(type = "od", zones = "distr", dates = x_dates)

dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")
dbSendQuery(x$src$con, "SET enable_progress_bar = true;")
dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")


progress <- function(x) {
  if (cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv)
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
  if (x > 100) {
    cli::cli_progress_done(.envir = .GlobalEnv)
  }
}

options("duckdb.progress_display" = progress)
duckdb::dbSendQuery(x$src$con, "SET progress_bar_time = 0;")

xx <- x |> group_by(id_origin, date, activity_origin) |> summarise(mean_trips = mean(n_trips)) |> collect()
Screenshot 2025-01-07 at 00 31 57

And it works!

@meztez do we have to manually define the progress function though...? what is the final idea of this PR? I would expect that progress bar just 'magically' appears as soon as we do:

dbGetQuery(x$src$con, "SELECT current_setting('enable_progress_bar');")

p.s. in my case x$src$con is because spod_get returns tbl_duckdb_connection, so you have to reach out to the connection itself.

@meztez

meztez commented Jan 6, 2025

Copy link
Copy Markdown
Contributor Author

It could provide a dummy default. It's just a function(x) called with progress percentage from within duckdb-r.

I'm not the package maintainer and I just needed it for a deliverable, so whatever works is fine by me.

@krlmlr

krlmlr commented Jan 7, 2025

Copy link
Copy Markdown
Collaborator

Thanks for the PR!

Looking at the implementation, I think the callback function should be a slot in the connection object. There could be basic reporting (opt-out, in interactive mode only) in the duckdb R package, and more sophisticated progress in duckplyr.

@e-kotov

e-kotov commented Jan 7, 2025

Copy link
Copy Markdown

I'm not the package maintainer and I just needed it for a deliverable, so whatever works is fine by me.

@meztez totally makes sense. Thanks for the work in the internals to make this possible!

Looking forward for this to be merged!

@HenrikBengtsson

Copy link
Copy Markdown

In the above examples, (x > 100) indicates that the processing is complete. Shouldn't that be (x >= 100)? I think it's more common to consider 100% to indicate "done" than "still processing".

@meztez

meztez commented Jan 7, 2025

Copy link
Copy Markdown
Contributor Author

In the above examples, (x > 100) indicates that the processing is complete. Shouldn't that be (x >= 100)? I think it's more common to consider 100% to indicate "done" than "still processing".

progress <- function(x) {
  if (x < 100 && cli::cli_progress_num() == 0) {
    cli::cli_progress_bar("Duckdb SQL", total = 100, .envir = .GlobalEnv, )
  }
  cli::cli_progress_update(set = x, .envir = .GlobalEnv)
}

options("duckdb.progress_display" = progress)

@e-kotov

e-kotov commented Jan 9, 2025

Copy link
Copy Markdown

I have done some more testing here: rOpenSpain/spanishoddata#124 (comment).

To summarize, it seems like at the moment the progress bar behavior is dependent on the data size and if you are filtering to any particular part of the data. That is, if you have 100GB of data, and your query is running on the data that is stored somewhere in the beginning of the file (I used duckdb file format), then you will get some progress from 1% to 3%, and then it will just jump to 100%. Similarly, if you filter to the data that is somewhere in the end of the database file, it will jump to 70% or 90% from the very beginning of the query.

So at the moment the progress bar implementation is not very informative and useful.

The question is if this is an upstream problem (and normal behavior for DuckDB), or if this is an artifact of how the progress bar reporting was implemented in the R package in this PR.

@meztez @krlmlr do you have any insights if what I'm describing is expected behavior for the progress bar, and if not if this can be fixed?

@meztez

meztez commented Jan 9, 2025

Copy link
Copy Markdown
Contributor Author

@e-kotov Try the same thing with duckdb cli and see if you get the same behavior.
https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager&architecture=x86_64

I really do not see anything from the R package side that would make this behavior any different. Also this quick search in duckdb/duckdb issues : duckdb/duckdb#12454

@e-kotov

e-kotov commented Jan 9, 2025

Copy link
Copy Markdown

@e-kotov Try the same thing with duckdb cli and see if you get the same behavior.
https://duckdb.org/docs/installation/index?version=stable&environment=cli&platform=win&download_method=package_manager&architecture=x86_64

So far tried with Python module, and got similar behaviour.

Also this quick search in duckdb/duckdb issues : duckdb/duckdb#12454

I saw this one, but they said this particular one was fixed. There were a few more similar issues in the upstream repo, but nothing of the sort that I described.

I will report this upstream.

@meztez

meztez commented Jan 9, 2025

Copy link
Copy Markdown
Contributor Author

Thanks for the PR!

Looking at the implementation, I think the callback function should be a slot in the connection object. There could be basic reporting (opt-out, in interactive mode only) in the duckdb R package, and more sophisticated progress in duckplyr.

@krlmlr
If its a slot on the connection class, how would it determine in rapi_connect if it needs to set create_display_func?
Did you mean a slot on the Driver class? I don't grok the logic you had in mind for the slot. Sorry.

@krlmlr

krlmlr commented Jan 20, 2025

Copy link
Copy Markdown
Collaborator

@hannes: Can we guarantee that the progress update functions are always called from the "main" thread (the one that initiated the execution)?

@krlmlr

krlmlr commented Jan 20, 2025

Copy link
Copy Markdown
Collaborator

It does look like the calls are always coming from the same thread ID, but better to be sure here.

@krlmlr krlmlr added the duckplyr 🗜️ Support for the duckplyr R package label Feb 13, 2025
@krlmlr

krlmlr commented Feb 14, 2025

Copy link
Copy Markdown
Collaborator

I reduced the wait time to zero to get immediate progress.

I was wrong, custom progress callbacks per connection need duckdb/duckdb#16232, and just changing this here breaks compatibility with extensions. Going with the option, but need to check if running interactively too. Will add to that branch.

@krlmlr

krlmlr commented Feb 14, 2025

Copy link
Copy Markdown
Collaborator

Wonderful, thank you!

progress

@krlmlr

krlmlr commented Feb 14, 2025

Copy link
Copy Markdown
Collaborator

Tweaked:

progress

@krlmlr

krlmlr commented Feb 14, 2025

Copy link
Copy Markdown
Collaborator

The source, if anyone wants to tweak further:

cast_1 <- asciicast::record(
  as.list(quote({
    con <- DBI::dbConnect(duckdb::duckdb())
    DBI::dbExecute(con, "LOAD httpfs")
    
    url <- "https://blobs.duckdb.org/flight-data-partitioned/Year=2022/data_0.parquet"
  }))[-1],
  rows = 7,
  end_wait = 1,
  typing_speed = 0,
  show_output = TRUE
)

cast_2 <- asciicast::record(
  as.list(quote({
    sql <- paste0("SELECT COUNT(Origin) FROM read_parquet('", url, "')")
    sql
  }))[-1],
  startup = quote({
    con <- DBI::dbConnect(duckdb::duckdb())
    DBI::dbExecute(con, "LOAD httpfs")
    
    url <- "https://blobs.duckdb.org/flight-data-partitioned/Year=2022/data_0.parquet"
  }),
  show_output = TRUE
)

cast_3 <- asciicast::record(
  as.list(quote({
    DBI::dbGetQuery(con, sql)
  }))[-1],
  startup = quote({
    con <- DBI::dbConnect(duckdb::duckdb())
    DBI::dbExecute(con, "LOAD httpfs")
    
    url <- "https://blobs.duckdb.org/flight-data-partitioned/Year=2022/data_0.parquet"
    sql <- paste0("SELECT COUNT(Origin) FROM read_parquet('", url, "')")
    sql
  }),
  show_output = TRUE
)

cast <- asciicast::merge_casts(cast_1, cast_2, cast_3)
asciicast::play(cast)

asciicast::write_svg(cast, "progress.svg")

@krlmlr krlmlr merged commit 26c47bf into duckdb:main Feb 14, 2025
@krlmlr

krlmlr commented Feb 14, 2025

Copy link
Copy Markdown
Collaborator

Thanks!

@meztez

meztez commented Feb 14, 2025

Copy link
Copy Markdown
Contributor Author

🔥

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators Feb 15, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

duckplyr 🗜️ Support for the duckplyr R package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants