Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Cannot read datasets partitioned by columns starting with dots #32061

Closed
asfimport opened this issue Jun 2, 2022 · 1 comment
Closed

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jun 2, 2022

As in the title.
It might be due to the fact that files starting with dots are hidden.
No issues if the dot appears elsewhere.

Reprex:

library(dplyr)
library(arrow)

packageVersion("arrow")
#> [1] '8.0.0'

path_arrow_tmp <- tempfile()

mtcars %>% 
   dplyr::group_by(cyl) %>% 
   arrow::write_dataset(
      path = path_arrow_tmp
   )

base::list.files(path_arrow_tmp, recursive = TRUE, all.files = TRUE)
#> [1] "cyl=4/part-0.parquet" "cyl=6/part-0.parquet" "cyl=8/part-0.parquet"

mtcars_load <- path_arrow_tmp %>% 
   arrow::open_dataset() %>% 
   dplyr::collect()

setequal(mtcars$mpg, mtcars_load$mpg)
#> [1] TRUE

# Change grouping by ".cyl"

path_arrow_tmp_grp <- tempfile()

mtcars %>% 
   dplyr::mutate(.cyl = cyl) %>% 
   dplyr::group_by(.cyl) %>% 
   arrow::write_dataset(
      path = path_arrow_tmp_grp
   )

# the files are there
base::list.files(path_arrow_tmp_grp, recursive = TRUE, all.files = TRUE)
#> [1] ".cyl=4/part-0.parquet" ".cyl=6/part-0.parquet" ".cyl=8/part-0.parquet"

# 0 files detected
path_arrow_tmp_grp %>% 
   arrow::open_dataset()
#> FileSystemDataset with 0 Parquet files

# Specify partitioning manually
# still no files

path_arrow_tmp_grp %>% 
   arrow::open_dataset(
      partitioning = ".cyl",
      hive_style = TRUE
   )
#> FileSystemDataset with 0 Parquet files
#> .cyl: int32

Environment: #> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.1 (2021-08-10)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Switzerland.1252
#> ctype C
#> tz Europe/Berlin
#> date 2022-06-02
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
#> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.3)
#> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.1)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.1)
#> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.1)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.1)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
#> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1)
#> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.1)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.1)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3)
#> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1)
#> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.1.1)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.3)
#> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
#> yaml 2.2.2 2022-01-25 [1] CRAN (R 4.1.2)
Reporter: Lorenzo Gaborini

Related issues:

Note: This issue was originally created as ARROW-16720. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Neal Richardson / @nealrichardson:
By default, the dataset file discovery ignores files and directories that start with . or _. A recent, not yet released change (ARROW-15280) enables you to override this by providing factory_options (example here). Could you try installing a nightly build of the package and see if you can read your dataset by providing that option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant