Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duckdb R client uses excessive RAM and crashes #72

Open
2 tasks done
cboettig opened this issue May 2, 2022 · 48 comments
Open
2 tasks done

duckdb R client uses excessive RAM and crashes #72

cboettig opened this issue May 2, 2022 · 48 comments
Milestone

Comments

@cboettig
Copy link

cboettig commented May 2, 2022

What happens?

duckdb crashes when trying to perform queries on a large (~150 GB) parquet database

To Reproduce

Apologies I cannot reproduce this with a smaller dataset. I can reproduce the same errors using the remote S3 access though which might be a bit faster than downloading a full copy of the dataset, or you could probably download a copy from my server in a few days. reproducible examples of both remote and local-based access below. (The remote example goes via arrow because I cannot get direct access to S3 buckets via duckdb alone yet).

## 
library(arrow)
library(dplyr)
library(duckdb)
path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE)
obs <- arrow::open_dataset(path) |> to_duckdb()
tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute() # crashes

Or, after downloading from the above public S3 bucket (https://minio.carlboettiger.info/ebird/Mar-2022/observations), try local parquet access:

## pure local duckdb crashes on simple operations:
library(duckdb)
library(dplyr)
parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- dbConnect(duckdb(), "/home/shared-data/ebird/db")
dbExecute(conn = conn, paste0("PRAGMA memory_limit='12GB'"))
view_query <- paste0("CREATE VIEW 'observations' AS SELECT * FROM parquet_scan('",
                     parquet, "');")
DBI::dbSendQuery(conn, view_query)
obs <- tbl(conn, "observations")

tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute()

With the memory limit PRAGMA in place, the example doesn't crash the R session but does throw an OOM error:

Error: duckdb_execute_R: Failed to run query
Error: Out of Memory Error: could not allocate block of 262144 bytes

Environment (please complete the following information):

  • OS: Ubuntu 20.04
  • DuckDB Version: 0.3.4 (latest)
  • DuckDB Client: R

Before Submitting

  • Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.

by the way, install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL) no longer works as there seems to no longer be master-builds download?

  • Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
@hannes
Copy link
Member

hannes commented May 2, 2022

How much RAM does the environment have? The aggregation likely runs out of memory here. We still need to add an out-of-core aggregation operator.

@cboettig
Copy link
Author

cboettig commented May 2, 2022

Thanks! I'm running on a machine with 32 GB RAM
(also apologies, just noticed my S3 bucket isn't configured right, fixing that now so the above example should work again)

@rsund
Copy link

rsund commented May 2, 2022

This won't fix the problem, but with DuckDB 0.3.3 or above you could simplify your connection code to local parquet files:

parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- tbl(conn, paste0("read_parquet('", parquet, "')"))

And it may be possible to avoid crash in the arrow-version by giving a memory-limited connection to to_duckdb() as the default connection created otherwise does not define memory limit:

path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- arrow::open_dataset(path) |> to_duckdb(conn)

The download-link you gave was not (yet) working, so I couldn't test with the actual data.

FYI, in case it would be useful to you, there is an unofficial DuckDB R package (updated after each merge to DuckDB master branch that also contains DuckDB extensions which can be easily compiled with the default R toolchain) available in my r-universe:
https://rsund.r-universe.dev/ui#package:duckdb

@cboettig
Copy link
Author

cboettig commented May 2, 2022

Thanks @rsund ! hmm, thanks for sharing the above snytax, that's cool that you don't need to explicitly declare the creation of a VIEW or execute a PRAGMA.

Also I've fixed my S3 bucket address, sorry (forgot to include the subdir for month)

library(arrow)
library(dplyr)
library(duckdb)

path <- arrow::s3_bucket("ebird/Mar-2022/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE)
conn <- DBI::dbConnect(duckdb(), ":memory:", config=list("memory_limit"="12GB"))
obs <- arrow::open_dataset(path) |> to_duckdb(conn)

tmp <- obs |> 
  group_by(sampling_event_identifier, scientific_name) |>
  summarize(count = sum(observation_count, na.rm=TRUE),
            .groups = "drop") 
tmp <- tmp |> compute() # crashes

Still crashing though, I suspect because of what @hannes pointed out already about the aggregation. Is there an existing issue thread I should watch for that? 👀

@rsund
Copy link

rsund commented May 3, 2022

Well, you should not use :memory: database in this case unless you define also a temp_directory:

conn <- DBI::dbConnect(duckdb(), ":memory:", config=list("memory_limit"="12GB", "temp_directory" = "/tmp"))

or use a file-based database:

conn <- DBI::dbConnect(duckdb(), "database.duckdb", config=list("memory_limit"="12GB"))

It is also unclear to me how Arrow handles memory allocation for online data sources and when queries for those are run by DuckDB there may be in the worst case a need for having the whole required columns (temporarily) in memory. In principle, that should not be the case as data is scanned in chunks and the excess memory consumption will be related only to the aggregate step that is not yet tuned for using only limited memory.

In this particular case it should be quite straightforward to "manually" run the aggregate query in chunks if there is some sensible variable to be used in splitting the data into pieces. But certainly it is more work and shouldn't be needed here as both Arrow and DuckDB are expected to take care of larger than memory data.

About the "new" features in R client: config argument has existed there a long time already, but it was just not documented and all examples used PRAGMA to change the setup. The support for registering parquet and other files / functions directly in dplyr::tbl() is, however, a new feature. With DuckDB 0.3.3 also the DuckDB-tailored backend for dbplyr was released and now e.g. the tidyr::fill() works in queries as expected.

@cboettig
Copy link
Author

cboettig commented May 3, 2022

Thanks @rsund , these details are super helpful to me.

Re using :memory:, I didn't realize that a temp dir is not set by default. is there a good reason that the R client does not default to the R tempdir automatically for this? (unfortunately, in this case the high memory use occurs even with a file-based database, as in my previous example).

Yeah, it's also unclear to me how arrow handles memory allocations in this context. FWIW, as you probably know the same queries are possible in 'pure arrow' without the use of duckdb, but result in similar OOM crashes there, though perhaps for quite different reasons. (e.g. https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531027#comment-17531027). I'm a huge fan of these packages but like you say, being able to seamlessly work with larger-than-RAM data is really the killer feature here.

Thanks also for the heads up that the tidyr::fill() issue has been addressed, that's awesome. I'm closing duckdb/duckdb#2372 then!

@hannes
Copy link
Member

hannes commented May 4, 2022

Arrow runs into the same problem. For DuckDB, out-of-core capability is absolutely one of our main goals, its just not supported yet in every operator. The suggestion to use the R tempdir automatically is interesting.

@hannes
Copy link
Member

hannes commented May 4, 2022

@lnkuiper if you needed any reason to look at the aggregates again ^^

@1beb
Copy link

1beb commented May 19, 2022

I think one approach could be to summarize over some partition, then aggregate, at least for simple counts. Often arrow files are partitioned by something. However, arrow doesn't have an intrinsic chunking method (by row number) that I'm aware of beyond the splitting of parquet files. duckdb offers traditional windows so perhaps specifying a chunk size (nlines), windowing, and then mapping over the windows with a final summarization step would be appropriate. I don't know if this belongs inside of the duckdb R package or rather as an ad-hoc piece of code.

@cboettig
Copy link
Author

@hannes et al any news on out-of-core capability in aggregates?

In some news, at least I can now reproduce the crash in pure SQL without the arrow wrapper for the S3 bit 😉 :

INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';

COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
  SELECT
    sampling_event_identifier,
    scientific_name,
    SUM(observation_count) AS count
  FROM (
    SELECT sampling_event_identifier, scientific_name, observation_count
    FROM read_parquet("s3://ebird/observations/*")
  )
  GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET);

This is a public bucket, so if I use the https:// addressing and limit to something considerably less than all 1124 partitions in the dataset, it doesn't crash. as such I can probably hack together the query by working over partitions in chunks manually but that seems less than ideal...

@github-actions
Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@cboettig
Copy link
Author

cboettig commented Jul 31, 2023

Any news on out-of-core capability in this case, or ability to leverage temp dir to avoid OOM kills on these operations? (as you already know, the above operations still lead to the OOM killer even in pure SQL.

@Tmonster
Copy link
Contributor

Tmonster commented Sep 6, 2023

@cboettig Not sure, but potentially this PR will address your issue?
duckdb/duckdb#8475

@cboettig
Copy link
Author

@Tmonster Thanks for flagging me! I just tested the examples above again having installed from github, but sadly duckdb still just consumes all available RAM and crashes

@hannes
Copy link
Member

hannes commented Sep 15, 2023

Maybe also something where @lnkuiper could have a look - can we make a pure SQL repro for this?

@Tmonster
Copy link
Contributor

pure SQL repo here, #72

@cboettig do you think you could make the dataset publicly accessible again?

@cboettig
Copy link
Author

I think the pure SQL example above should still be hitting a public-read bucket.

INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';

COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
  SELECT
    sampling_event_identifier,
    scientific_name,
    SUM(observation_count) AS count
  FROM (
    SELECT sampling_event_identifier, scientific_name, observation_count
    FROM read_parquet("s3://ebird/observations/*")
  )
  GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET);

lemme know if that's not working for you?

@Tmonster
Copy link
Contributor

Hi, I've been able to reproduce it. It seems to be a bug indeed. I'm working on narrowing down the exact issue, but a current workaround would be to

  1. First create a local copy of the data
  2. Query the data into a local table
  3. disable preserve_insertion_order
  4. Copy the answer to parquet
create table observations as select * from read_parquet("s3://ebird/observations/*");
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
  SELECT
    sampling_event_identifier,
    scientific_name,
    SUM(observation_count) AS count
  FROM (
    SELECT sampling_event_identifier, scientific_name, observation_count
    FROM observations
  )
  GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'output.parquet' (FORMAT PARQUET);

If you have a persistent database or set a temporary directory for an in-memory database this should work. Worked on my laptop with 16GB.

@cboettig
Copy link
Author

cboettig commented Nov 1, 2023

Thanks much for this @Tmonster , it's great to be able to work around the issue this way.

I'm still keen on having a way for users to execute this remotely without having to create the local copy, but it's really nice for this to be unstuck. Please share if you've had any further luck in pinpointing the exact issue that causes this query to OOM

@cboettig
Copy link
Author

cboettig commented Nov 1, 2023

@Tmonster maybe I spoke too soon? this is still hitting an OOM error on my machine with 48 GB RAM. Maybe this is a secondary issue with how back-pressure is implemented, since I'm running inside a container capped at 48GB on a host machine with 64 GB RAM. I am setting both a tempdir path PRAGMA and a max memory limit PRAGMA.

Running on Ubuntu 22.04 on amd64 arch. (Here's the R version, though note it's just passing the SQL as above, I don't think R is really involved in the issue)

library(duckdb)
con <- dbConnect(duckdb())

pragmas <- paste0(
"PRAGMA memory_limit='32GB';
 PRAGMA temp_directory='", tempdir(), "';")
dbExecute(conn = con, pragmas)


query <-
"
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';

create table observations as select * from read_parquet('s3://ebird/observations/*');
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
  SELECT
  sampling_event_identifier,
  scientific_name,
  SUM(observation_count) AS count
  FROM (
    SELECT sampling_event_identifier, scientific_name, observation_count
    FROM observations
  )
  GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'ebird.parquet' (FORMAT PARQUET);
"

dbSendQuery(con, query)

@Tmonster
Copy link
Contributor

Tmonster commented Nov 1, 2023

Hmmm, I'm not positive, but there may be a difference in setting up a tempdir vs connecting to a persistent database file

Can you try replacing the second line with con <- dbConnect(duckdb(),dbdir='test.db')
If you do that you won't need to establish a temp directory.

@cboettig
Copy link
Author

cboettig commented Nov 2, 2023

good idea but no luck there, still consumes all available RAM (even on my larger machine with 120 GB) and crashes

@Tishj
Copy link

Tishj commented Nov 2, 2023

good idea but no luck there, still consumes all available RAM (even on my larger machine with 120 GB) and crashes

Just gathering some more potentially useful info:
How many cores does the machine have, and how many does duckdb detect? (select current_setting('threads');)

Also something worth trying to further troubleshoot the issue: Can you reproduce the issue using the CLI ?
Since you mentioned it is just using raw SQL

@cboettig
Copy link
Author

cboettig commented Nov 3, 2023

Thanks -- great ideas here. I did try running in the CLI, and setting threads to 1 with PRAGMA threads=1; , (the test machines have 24 and 128 threads respectively) unfortunately this still crashes OOM for me. hmmmm! 🤔

Copy link

github-actions bot commented Feb 2, 2024

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@SimonCoulombe

This comment was marked as off-topic.

@Tmonster
Copy link
Contributor

Tmonster commented Feb 9, 2024

Hi Simon,

I think what you are seeing here is the effect of DuckDB materializing the data from the table, allocating the R data frame, then copying the data into R.

DuckDB doesn't know how much space it should allocate until it has all of the data, so first all the data needs to be materialized in DuckDB's memory (~16GB). Then DuckDB needs to allocate an R data frame to hold all the memory (~also 16GB) so that R can use the data. Once Duckdb passes off the data, it lives only in R, which is why you see 16GB of memory in use again.

Interesting that DuckDB doesn't provide any feedback that the memory limit has been reached in this case. That is something I can look into.

@SimonCoulombe
Copy link

SimonCoulombe commented Feb 9, 2024

Hi Simon,

I think what you are seeing here is the effect of DuckDB materializing the data from the table, allocating the R data frame, then copying the data into R.

DuckDB doesn't know how much space it should allocate until it has all of the data, so first all the data needs to be materialized in DuckDB's memory (~16GB). Then DuckDB needs to allocate an R data frame to hold all the memory (~also 16GB) so that R can use the data. Once Duckdb passes off the data, it lives only in R, which is why you see 16GB of memory in use again.

Interesting that DuckDB doesn't provide any feedback that the memory limit has been reached in this case. That is something I can look into.

Thanks!
I think I understand the process you describe is happening currently. I am not sure what is "feedback" in this situation and what is the expected behaviour.

Is it

  1. It is expected that DuckDB will need to allocate the data in both DuckDB's memory (16GB) and R's memory (another16GB). Feedback in this context is telling the user that we can't respect the memory_limit?

or
2) It is expected that DuckDB will respect the memory limit in DuckDB's memory (memory_limit =1GB) so that the total RAM use should be 1GB (duckdb) + 16 GB (R). In this case, I'm not sure what feedback means.

@szarnyasg szarnyasg transferred this issue from duckdb/duckdb Feb 9, 2024
@Tmonster
Copy link
Contributor

Sorry, my mention of feedback wasn't clear.

To clarify, feedback from DuckDB would be some kind of notification to the user that DuckDB requires more memory than what is specified by the memory_limit.

Normally if duckdb attempts to use more memory than specified by the memory limit, an Out of memory error is thrown. So I am curious as to why we don't see something like that here.

We have an internal issue to track this

@SimonCoulombe
Copy link

SimonCoulombe commented Feb 12, 2024

gotcha , thanks again.

Just to make sure, there is no way to read/write data from duckdb database to R without using twice as much as RAM as the full data size at some point, correct?

@krlmlr

This comment was marked as outdated.

@krlmlr

This comment was marked as off-topic.

@SimonCoulombe

This comment was marked as off-topic.

@cboettig
Copy link
Author

@krlmlr still an issue though it may be a different one from @SimonCoulombe 's. I think my example above is a symptom of not all the aggregation methods being implemented out-of-core yet? as noted above, I believe my example is still crashing out of ram when executed as pure SQL without R involved.

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 24, 2024

Thanks!

@cboettig: If this can be replicated with the command-line client, an issue in the upstream repo https://github.com/duckdb/duckdb/ would be useful.

@SimonCoulombe: Is your issue still a problem with the latest dev version? You can also install from r-universe: https://duckdb.r-universe.dev/duckdb# . If yes, let's discuss this in a new issue.

@cboettig
Copy link
Author

@krlmlr thanks, it's been a while since I checked this in the CLI client. Just tested now with the identical code I provided above and it is working just fine in the cli, so it is indeed now just an R issue at this point.

I'm not using DBI writeTable explicitly above though, so I don't know if this is or isn't the same issue as @SimonCoulombe is reporting.

Again, here's the code that I run that works in the CLI, but fails when called via R. In the CLI, it uses just above 30GB (RES use peaks around 35.7 GB, VIRT use is higher but I understand that doesn't matter so much). Works fine in the CLI client, crashes in R.

PRAGMA memory_limit='30GB';
PRAGMA temp_directory='/tmp/duckdb';

INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';

# try view instead
create table observations as select * from read_parquet('s3://ebird/observations/*');

create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
  SELECT
  sampling_event_identifier,
  scientific_name,
  SUM(observation_count) AS count
  FROM (
    SELECT sampling_event_identifier, scientific_name, observation_count
    FROM observations
  )
  GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'ebird.parquet' (FORMAT PARQUET);

@krlmlr
Copy link
Collaborator

krlmlr commented Feb 27, 2024

Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0?

@SimonCoulombe
Copy link

SimonCoulombe commented Feb 27, 2024

Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0?

on it -- let me check
( installing using remotes::install_github(repo="duckdb/duckdb-r", ref = remotes::github_pull(90)) )

@SimonCoulombe

This comment was marked as off-topic.

@SimonCoulombe

This comment was marked as off-topic.

@krlmlr

This comment was marked as off-topic.

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 3, 2024

@cboettig: From looking at your example, I see no reason why it shouldn't work with the current dev version, soon to hit CRAN. That version contains exactly the same core as the v0.10.0 release.

I have split your script into an importing and a processing stage, the importing stage is now at about 2 GB with the progress still being at 0%. I wonder how much data this example contains. More important though is the question if we can downsize the example so that it still shows the problem without having to wait that long.

@cboettig
Copy link
Author

cboettig commented Mar 3, 2024

Thanks much @krlmlr , really appreciate this. Yes, this a large example (easy to check using s3, e.g.

> library(minioclient)
> mc_alias_set("ex", "minio.carlboettiger.info", "", "")
Added `ex` successfully.

> mc_du("ex/ebird/observations/")
152GiB	1124 objects	ebird/observations

I'd love to find a smaller example that illustrates the problem. You could certainly run the command any one or any subset of the 1124 parquet files (mc_ls("ex/ebird/observations") ) listed there, but I'm not sure how to replicate the problem that way -- when I try a really small subset (e.g just part-0.parquet say) it works just fine in about a second, even it I set something like "PRAGMA memory_limit='1GB';. All the really small examples work great, the problem is just that the large real-world examples crash in R. I'm not clever enough to see how to reproduce that behavior with a small example

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 3, 2024

Thanks. The memory limit can be as small as 10 MB or a small multiple, that's already an awful lot of memory blocks that the database engine can shuffle around. True, some operations don't work at all if the memory limit is too aggressive, but perhaps there's a sweet spot?

If you still have a duckdb file with the data, could you try running the query with the most recent duckdb package from r-universe? Chances are it'll just work.

On a side note, I can't install the MinIO client on my macOS M1. Known issue?

@cboettig

This comment was marked as off-topic.

@krlmlr

This comment was marked as off-topic.

@krlmlr
Copy link
Collaborator

krlmlr commented Mar 11, 2024

I stand corrected -- a 10 MB memory limit doesn't seem to be a good simulation of reality, perhaps 125 MB per thread are more realistic?

@krlmlr krlmlr modified the milestones: 0.10.0, 0.10.1 Mar 11, 2024
@cboettig
Copy link
Author

still testing with this. I'm not entirely clear on difference between the options

SET memory_limit = '1GB';
SET max_memory = '1GB';

are these just aliases for the same thing? Also from https://duckdb.org/docs/configuration/pragmas#memory-limit , it sounds like these aren't enforced on certain aggregation operations

The specified memory limit is only applied to the buffer manager. For most queries, the buffer manager handles the majority of the data processed. However, certain in-memory data structures such as vectors and query results are allocated outside of the buffer manager. Additionally, aggregate functions with complex state (e.g., list, mode, quantile, string_agg, and approx functions) use memory outside of the buffer manager. Therefore, the actual memory consumption can be higher than the specified memory limit.

though I need to look more closely to understand if/when I encounter any of those cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants