Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read/write dm as csv/zip(csv)/xlsx #485

Closed
wants to merge 63 commits into from

Conversation

TSchiefer
Copy link
Member

  • dm_write_csv(dm, csv_directory): write dm as collection of csv-files
  • dm_read_csv(csv_directory): read dm from directory created using dm_write_csv()
  • dm_write_zip(dm, zip_file_path = "dm.zip", overwrite = FALSE): same as csv, but zipped.
  • dm_read_zip(zip_file_path)
  • dm_write_xlsx(dm, xlsx_file_path = "dm.xlsx", overwrite = FALSE)
  • dm_read_xlsx(xlsx_file_path)

I am prepared for a longer wait until the conclusion of this PR, since there might be a few things to discuss.

For example:

  • should we already prepare for compound keys.
  • will it work for all (important) column classes? what will be the default if it doesn't work?
  • could there be some way to support writing remote dm to a file/directory?

closes #276

@TSchiefer TSchiefer requested a review from krlmlr March 3, 2021 14:17
@krlmlr
Copy link
Collaborator

krlmlr commented Mar 3, 2021

Nice! To the questions:

  1. Yes, absolutely.
  2. Failure, perhaps with extensibility option later.
  3. Looks very niche -- if it fit into RAM, we can collect(), if not then what's the purpose to save as CSV?

We should also integrate with {shard}.

R/read-write.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
@TSchiefer
Copy link
Member Author

We should also integrate with {shard}.

Excuse my ignorance, but is what you mean this?
If yes, maybe in another PR?

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have compound keys, need to adapt.

I'd prefer using existing methods for enumerating primary and foreign keys.

R/error-helpers.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
R/read-write.R Outdated Show resolved Hide resolved
csv_files <- list.files(csv_directory)
# compress the file ("-j" junks the path to the file)

zip(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect {zip} might work better here: https://cran.r-project.org/web/packages/zip/index.html.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if it works it's more platform agnostic, but in the first test (which works with utils::zip()), I get the error:

Error in zip_internal(zipfile, files, recurse, compression_level, append = FALSE,  : 
  zip error: `Cannot add file `/var/folders/x3/ndmkxk1j2wn0mx2pw9v9httm0000gn/T//RtmpmKOSNA/dm_zip_1149248ed84b/___coltypes_file_dm.csv` to archive `___test_path/dm.zip`` in file `zip.c:348`

Not sure if it's worth the effort to try finding the source of this problem.


if (file.exists(xlsx_file_path)) {
if (overwrite) {
message(glue::glue("Overwriting file {tick(xlsx_file_path)}."))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we mute this message?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to be able to mute this? i.e., how often will users recreate/overwrite files? (honestly not sure)
We could add a quiet argument

"POSIXct"
)

convert_all_times_to_utc <- function(table_list, col_class_table) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this within the scope of the function? Will it affect the roundtrip, or only the snapshot tests, if we omit UTC conversion?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several problems:

  1. unfortunately it's not possible to read timezones via readr::read_csv(). Therefore, it seems much safer to convert to UTC and to inform when writing the csv files.
  2. for xlsx: actually writexl::write_xlsx() does the conversion to UTC quietly itself anyway. In this case we would not need to perform the conversion, but just inform the user.

I think it's not harmful to leave it as is, but I am open to suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. unfortunately it's not possible to read timezones via readr::read_csv(). Therefore, it seems much safer to convert to UTC and to inform when writing the csv file

I was too quick to claim that:
https://readr.tidyverse.org/articles/locales.html
It is actually possible to steer the timezone with the argument locale in readr::read_csv(). Not sure if making use of this possibility improves the transparency of our functions (mainly if it doesn't work with xlsx)

c("Converting the datetime values for the following column(s) to timezone `UTC`:\n",
glue::glue("{paste0(to_convert$table, '$', to_convert$column, collapse = '\n')}"))
)
table_list <- reduce2(to_convert$table, to_convert$column, function(tables, table, column) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutate(across()) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, but my thought was that it's good to inform users which columns are converted. And since then we know those columns, we can make those changes as well explicitly.

)

convert_all_times_to_utc <- function(table_list, col_class_table) {
if (any(col_class_table$class %in% c("POSIXlt", "POSIXct"))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this won't pick up subclasses of "POSIXlt" or "POSIXct"? Not sure if this is relevant though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as it is implemented, just the following types are supported: character, Date, integer, logical, numeric, POSIXct, POSIXlt.
If if turns out that there are useful further classes that should be supported, I would suggest these should be added in future PRs.

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 5, 2021

We also want check_suggested() from #572.

@TSchiefer
Copy link
Member Author

Compound works now

@TSchiefer TSchiefer requested a review from krlmlr July 20, 2021 11:08
@TSchiefer
Copy link
Member Author

We also want check_suggested() from #572.

for the functions from {readr}, {readxl}, {writexl}? good idea. Even though there is some implementation of it in main, shall we wait for the merge?

@krlmlr
Copy link
Collaborator

krlmlr commented Jul 25, 2021

This looks good, I'd like to play with it before merging. Is this blocking another project? If we had to choose between csv, zip and xlsx, what would be the preference?

@TSchiefer
Copy link
Member Author

TSchiefer commented Jul 26, 2021

This looks good, I'd like to play with it before merging. Is this blocking another project? If we had to choose between csv, zip and xlsx, what would be the preference?

Sure, take your time. It's not blocking anything - at least not for me.
xlsx is nice, since it's just one file (as opposed to csv) and one can still easily get an overview of the contents (as opposed to zip). Obvious disadvantage is the requirement of MS Excel (unless you're not interested in looking at the file in Excel and just want to restore it later with dm_read_xlsx()).

Copy link
Collaborator

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it locally. I think we need to adapt it to the case where a foreign key is linked to a non-primary key, also for the case of compound keys.

Let's wait for #517, it will be easier to serialize the output of dm_meta()

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 15, 2023

This is mostly new code, could also be moved elsewhere.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 20, 2023

Closing for now, added a reference to the issue.

@krlmlr krlmlr closed this Aug 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Store dm as xlsx or collection of csv files (zip)
2 participants