Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R][Docs] Improve documentation of col_types #38903

Open
assignUser opened this issue Nov 27, 2023 · 7 comments
Open

[R][Docs] Improve documentation of col_types #38903

assignUser opened this issue Nov 27, 2023 · 7 comments

Comments

@assignUser
Copy link
Member

assignUser commented Nov 27, 2023

Describe the enhancement requested

In a recent SO question about using partial schemas in open_dataset (which is possible using col_types) even a seasond arrow user did not know about the proper solution.

The docs for open_dataset hide a lot of more specialized options behind a ... and it it's not obvious how to find those as the linked dataset factory page also doesn't show all possibility. Some are explained in the specialized wrapper functions like https://arrow.apache.org/docs/r/reference/open_delim_dataset.html or https://arrow.apache.org/docs/r/reference/csv_convert_options.html but even there col_types is not described in a way that makes it obvious that it is to be used to pass in partial schemas.

At the minimum the doc strings for col_types should make the intended uses case clear, ideally we should link to the detailed descriptions from open_dataset or find another way to document the possible options more visibly.

Component(s)

Documentation, R

@ShaiviAgarwal2
Copy link

@assignUser Is this issue resolved? If not, I want to contribute to it!!

@assignUser
Copy link
Member Author

Nope and afaik noone is working on it so feel free to take it on!

@kou kou changed the title [R] [Docs] Improve documentation of col_types [R][Docs] Improve documentation of col_types Jan 9, 2024
@ShaiviAgarwal2
Copy link

@assignUser
To solve the issue of unclear documentation while working with partial schemas in the open_dataset function using col_types, we'll take a few steps to make things clearer for users.

First, we will go to the documentation and update the doc strings for col_types then make sure to clearly explain that col_types is used for passing partial schemas in open_dataset.

Next, we will add a direct link in the open_dataset documentation that leads to the detailed descriptions of the possible options, including col_types.
Or we could find another way to make these options more visible in the documentation. Maybe by creating a separate section or even a dedicated page for these specialized options.

@ShaiviAgarwal2
Copy link

@assignUser Am I thinking in the right direction and are you satisfied with my answer?

@ShaiviAgarwal2
Copy link

Could you please assign this task to me, I want to contribute to it!!

@assignUser
Copy link
Member Author

assignUser commented Jan 14, 2024

Am I thinking in the right direction and are you satisfied with my answer?

I have assigned the issue to you. You can also comment "/take" on an issue and a bot will assign it to you :)

@joelnitta
Copy link

I would add that the current documentation says that a "compact string representation" of column types is allowable. This is very similar to the wording of {readr}, so without additional explanation I assumed that's what it meant, but that this does not seem to work:

library(readr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#>    mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#>  2 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
#>  3 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#>  4 21.4  6     258   110   3.08  3.215 19.44 1     0     3     1    
#>  5 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2    
#>  6 18.1  6     225   105   2.76  3.46  20.22 1     0     3     1    
#>  7 14.3  8     360   245   3.21  3.57  15.84 0     0     3     4    
#>  8 24.4  4     146.7 62    3.69  3.19  20    1     0     4     2    
#>  9 22.8  4     140.8 95    3.92  3.15  22.9  1     0     4     2    
#> 10 19.2  6     167.6 123   3.92  3.44  18.3  1     0     4     4    
#> # ℹ 22 more rows

# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#>      ▆
#>   1. └─arrow (local) `<fn>`(...)
#>   2.   └─arrow::open_dataset(...)
#>   3.     └─DatasetFactory$create(...)
#>   4.       └─FileFormat$create(...)
#>   5.         └─CsvFileFormat$create(...)
#>   6.           └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#>   7.             ├─base::do.call(csv_file_format_convert_opts, args)
#>   8.             └─arrow (local) `<fn>`(...)
#>   9.               ├─base::do.call(csv_convert_options, opts)
#>  10.               └─arrow (local) `<fn>`(...)
#>  11.                 └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>."))

Created on 2024-01-24 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.1.2
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    UTF-8
#>  tz       Asia/Tokyo
#>  date     2024-01-24
#>  pandoc   3.1.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
#>  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.3.0)
#>  bit           4.0.5    2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64         4.0.5    2020-08-30 [1] CRAN (R 4.3.0)
#>  cli           3.6.2    2023-12-11 [1] CRAN (R 4.3.1)
#>  crayon        1.5.2    2022-09-29 [1] CRAN (R 4.3.0)
#>  digest        0.6.33   2023-07-07 [1] CRAN (R 4.3.0)
#>  evaluate      0.23     2023-11-01 [1] CRAN (R 4.3.1)
#>  fansi         1.0.6    2023-12-08 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.3    2023-07-20 [1] CRAN (R 4.3.0)
#>  glue          1.6.2    2022-02-24 [1] CRAN (R 4.3.0)
#>  hms           1.1.3    2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.7    2023-11-03 [1] CRAN (R 4.3.1)
#>  knitr         1.45     2023-10-30 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.4    2023-11-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
#>  pillar        1.9.0    2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr         1.0.2    2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0   2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2    2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.25.0   2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils       2.12.3   2023-11-18 [1] CRAN (R 4.3.1)
#>  R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.4    2023-02-10 [1] CRAN (R 4.3.0)
#>  reprex        2.0.2    2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang         1.1.2    2023-11-04 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.25     2023-09-18 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
#>  styler        1.10.2   2023-08-29 [1] CRAN (R 4.3.0)
#>  tibble        3.2.1    2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.3.0)
#>  tzdb          0.4.0    2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.4    2023-10-22 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.5    2023-12-01 [1] CRAN (R 4.3.1)
#>  vroom         1.6.5    2023-12-05 [1] CRAN (R 4.3.1)
#>  withr         2.5.2    2023-10-30 [1] CRAN (R 4.3.1)
#>  xfun          0.41     2023-11-01 [1] CRAN (R 4.3.1)
#>  yaml          2.3.8    2023-12-11 [1] CRAN (R 4.3.1)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants