Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copying the data frame to disk #89

Closed
multimeric opened this issue Dec 1, 2023 · 8 comments · Fixed by #128
Closed

Copying the data frame to disk #89

multimeric opened this issue Dec 1, 2023 · 8 comments · Fixed by #128
Milestone

Comments

@multimeric
Copy link

My impression at the moment is that if you use as_duckplyr_df(), then duckdb is actually processing the data frame in R's allocated memory. This may or may not be correct.

What I want to do is first save the data frame to disk in an efficient format (hopefully the DuckDB binary format, but parquet would work as well), and then re-open the data frame for querying on disk. Is there currently a way to do this, or a plan for how to do this with duckplyr?

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 1, 2023

Do we need to add more to our README? Happy to review a PR!


See ?duckplyr_df_from_file, this will keep the file on disk. The object returned is a data frame, but the data will be read only if you request column values or the number of rows. You can try with a large Parquet file: duckplyr_df_from_file() is instantaneous, but querying the data for the first time will take time. (In the RStudio IDE, you may need to upgrade to the latest version or to set "Manual refresh only" in the "Environment pane".)

image

When such a data frame is processed with dplyr, the processing happens on DuckDB and is using DuckDB memory. Only when a result is collected (either manually as above or because the code requests an operation that can't be run in DuckDB yet) the data is materialized as a data frame.

Materialization also emits a message by default, see ?config.

@multimeric
Copy link
Author

Thanks for the reply. So I understood duckplyr_df_from_file reads from disk, and I think the docs for that are fine. My use case was taking a standard in-memory R data frame, persisting it to disk and then re-opening it as a duckplyr dataframe. I guess it would be a bit rare to do this though, because if the data frame fits in memory as is, then there's no reason to put it on disk. I guess generally speaking I'm looking for a save implementation such as #87, but including the DuckDB binary format as well as parquet.

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 1, 2023

Parquet is the safest option for now. The DuckDB database that duckplyr uses is currently an opaque implementation detail, and supporting export to a different connection/database file is not straightforward.

@Tmonster: please correct me if I'm misrepresenting.

@Tmonster
Copy link
Contributor

Tmonster commented Dec 1, 2023

@krlmlr you are right, although creating a duckdb database for storing the data frame contents should be fairly straightforward.

persistent_con <- dbConnect(duckdb(), dbdir="persisted_table.db"))
dbWriteTable(persistent_con, "my_persistent_table", my_df)

I haven't tested this though, so unsure if it really is this easy.
The file persisted_table.db can then be opened using duckdb in any other client (CLI, python, etc.)
You can see more using the R api docs here.

Also working on a to_parquet function for duckplyr here

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 2, 2023

How can we pour the results of a query speicified by a relational object into a new table (possibly in a different database connection)? Do we need rel_to_table() ?

@Tmonster
Copy link
Contributor

Tmonster commented Dec 4, 2023

I think in this case we would need a rel_to_table() function. I imagine this is for the case where a df cannot fit in memory?

If not, we could also convert the relation object to a normal df and then use dbWriteTable. If we do that, you can write to different databases in the following (slightly hacky) way.

con <- dbConnect(duckdb(), dbdir='db1.db')
dbSendQuery(con, "attach 'db2.db'")
dbSendQuery(con, "use db2;")
dbWriteTable(con, "cars", mtcars)
dbSendQuery(con, "use db1;")
dbWriteTable(con, "cars", mtcars)

You can then verify that both databases have a copy of cars by opening them individually. Unfortunately I don't think we have a relational way yet to change databases.

Potentially also something I can add to the duckdb-r client

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 4, 2023

Most importantly, this would save the roundtrip through R memory. A rel_to_table() would be very much appreciated.

For writing to other databases, I suspect we can always attach them to our main database and write to a schema? Would that work?

@Tmonster
Copy link
Contributor

Tmonster commented Dec 5, 2023

Most importantly, this would save the roundtrip through R memory. A rel_to_table() would be very much appreciated.

Ah, I agree that would be helpful. I can look into this.

For writing to other databases, I suspect we can always attach them to our main database and write to a schema? Would that work?

Yes this will work. I can probably write a rel_to_table() function that writes a relation to table qualified by some user provided string, which is either table, or schema.table, or db.schema.table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants