[R][Docs] Improve documentation of `col_types` #38903

assignUser · 2023-11-27T22:15:00Z

Describe the enhancement requested

In a recent SO question about using partial schemas in open_dataset (which is possible using col_types) even a seasond arrow user did not know about the proper solution.

The docs for open_dataset hide a lot of more specialized options behind a ... and it it's not obvious how to find those as the linked dataset factory page also doesn't show all possibility. Some are explained in the specialized wrapper functions like https://arrow.apache.org/docs/r/reference/open_delim_dataset.html or https://arrow.apache.org/docs/r/reference/csv_convert_options.html but even there col_types is not described in a way that makes it obvious that it is to be used to pass in partial schemas.

At the minimum the doc strings for col_types should make the intended uses case clear, ideally we should link to the detailed descriptions from open_dataset or find another way to document the possible options more visibly.

Component(s)

Documentation, R

The text was updated successfully, but these errors were encountered:

ShaiviAgarwal2 · 2024-01-08T15:07:24Z

@assignUser Is this issue resolved? If not, I want to contribute to it!!

assignUser · 2024-01-09T00:33:23Z

Nope and afaik noone is working on it so feel free to take it on!

ShaiviAgarwal2 · 2024-01-09T03:32:16Z

@assignUser
To solve the issue of unclear documentation while working with partial schemas in the open_dataset function using col_types, we'll take a few steps to make things clearer for users.

First, we will go to the documentation and update the doc strings for col_types then make sure to clearly explain that col_types is used for passing partial schemas in open_dataset.

Next, we will add a direct link in the open_dataset documentation that leads to the detailed descriptions of the possible options, including col_types.
Or we could find another way to make these options more visible in the documentation. Maybe by creating a separate section or even a dedicated page for these specialized options.

ShaiviAgarwal2 · 2024-01-13T10:05:08Z

@assignUser Am I thinking in the right direction and are you satisfied with my answer?

ShaiviAgarwal2 · 2024-01-13T10:05:40Z

Could you please assign this task to me, I want to contribute to it!!

assignUser · 2024-01-14T18:07:24Z

Am I thinking in the right direction and are you satisfied with my answer?

I have assigned the issue to you. You can also comment "/take" on an issue and a bot will assign it to you :)

joelnitta · 2024-01-24T06:05:37Z

I would add that the current documentation says that a "compact string representation" of column types is allowable. This is very similar to the wording of {readr}, so without additional explanation I assumed that's what it meant, but that this does not seem to work:

library(readr)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#>    mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#>  2 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
#>  3 22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#>  4 21.4  6     258   110   3.08  3.215 19.44 1     0     3     1    
#>  5 18.7  8     360   175   3.15  3.44  17.02 0     0     3     2    
#>  6 18.1  6     225   105   2.76  3.46  20.22 1     0     3     1    
#>  7 14.3  8     360   245   3.21  3.57  15.84 0     0     3     4    
#>  8 24.4  4     146.7 62    3.69  3.19  20    1     0     4     2    
#>  9 22.8  4     140.8 95    3.92  3.15  22.9  1     0     4     2    
#> 10 19.2  6     167.6 123   3.92  3.44  18.3  1     0     4     4    
#> # ℹ 22 more rows

# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64

# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#>      ▆
#>   1. └─arrow (local) `<fn>`(...)
#>   2.   └─arrow::open_dataset(...)
#>   3.     └─DatasetFactory$create(...)
#>   4.       └─FileFormat$create(...)
#>   5.         └─CsvFileFormat$create(...)
#>   6.           └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#>   7.             ├─base::do.call(csv_file_format_convert_opts, args)
#>   8.             └─arrow (local) `<fn>`(...)
#>   9.               ├─base::do.call(csv_convert_options, opts)
#>  10.               └─arrow (local) `<fn>`(...)
#>  11.                 └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>."))

^{Created on 2024-01-24 with reprex v2.0.2}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31)
#>  os       macOS Sonoma 14.1.2
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    UTF-8
#>  tz       Asia/Tokyo
#>  date     2024-01-24
#>  pandoc   3.1.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
#>  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.3.0)
#>  bit           4.0.5    2022-11-15 [1] CRAN (R 4.3.0)
#>  bit64         4.0.5    2020-08-30 [1] CRAN (R 4.3.0)
#>  cli           3.6.2    2023-12-11 [1] CRAN (R 4.3.1)
#>  crayon        1.5.2    2022-09-29 [1] CRAN (R 4.3.0)
#>  digest        0.6.33   2023-07-07 [1] CRAN (R 4.3.0)
#>  evaluate      0.23     2023-11-01 [1] CRAN (R 4.3.1)
#>  fansi         1.0.6    2023-12-08 [1] CRAN (R 4.3.1)
#>  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.3.0)
#>  fs            1.6.3    2023-07-20 [1] CRAN (R 4.3.0)
#>  glue          1.6.2    2022-02-24 [1] CRAN (R 4.3.0)
#>  hms           1.1.3    2023-03-21 [1] CRAN (R 4.3.0)
#>  htmltools     0.5.7    2023-11-03 [1] CRAN (R 4.3.1)
#>  knitr         1.45     2023-10-30 [1] CRAN (R 4.3.1)
#>  lifecycle     1.0.4    2023-11-07 [1] CRAN (R 4.3.1)
#>  magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.3.0)
#>  pillar        1.9.0    2023-03-22 [1] CRAN (R 4.3.0)
#>  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 4.3.0)
#>  purrr         1.0.2    2023-08-10 [1] CRAN (R 4.3.0)
#>  R.cache       0.16.0   2022-07-21 [1] CRAN (R 4.3.0)
#>  R.methodsS3   1.8.2    2022-06-13 [1] CRAN (R 4.3.0)
#>  R.oo          1.25.0   2022-06-12 [1] CRAN (R 4.3.0)
#>  R.utils       2.12.3   2023-11-18 [1] CRAN (R 4.3.1)
#>  R6            2.5.1    2021-08-19 [1] CRAN (R 4.3.0)
#>  readr       * 2.1.4    2023-02-10 [1] CRAN (R 4.3.0)
#>  reprex        2.0.2    2022-08-17 [1] CRAN (R 4.3.0)
#>  rlang         1.1.2    2023-11-04 [1] CRAN (R 4.3.1)
#>  rmarkdown     2.25     2023-09-18 [1] CRAN (R 4.3.1)
#>  sessioninfo   1.2.2    2021-12-06 [1] CRAN (R 4.3.0)
#>  styler        1.10.2   2023-08-29 [1] CRAN (R 4.3.0)
#>  tibble        3.2.1    2023-03-20 [1] CRAN (R 4.3.0)
#>  tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.3.0)
#>  tzdb          0.4.0    2023-05-12 [1] CRAN (R 4.3.0)
#>  utf8          1.2.4    2023-10-22 [1] CRAN (R 4.3.1)
#>  vctrs         0.6.5    2023-12-01 [1] CRAN (R 4.3.1)
#>  vroom         1.6.5    2023-12-05 [1] CRAN (R 4.3.1)
#>  withr         2.5.2    2023-10-30 [1] CRAN (R 4.3.1)
#>  xfun          0.41     2023-11-01 [1] CRAN (R 4.3.1)
#>  yaml          2.3.8    2023-12-11 [1] CRAN (R 4.3.1)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

assignUser added the Type: enhancement label Nov 27, 2023

github-actions bot added Component: R Component: Documentation labels Nov 27, 2023

assignUser added good-first-issue and removed Component: R Component: Documentation labels Nov 27, 2023

github-actions bot added Component: R Component: Documentation labels Nov 27, 2023

kou changed the title ~~[R] [Docs] Improve documentation of col_types~~ [R][Docs] Improve documentation of col_types Jan 9, 2024

assignUser assigned ShaiviAgarwal2 Jan 14, 2024

joelnitta mentioned this issue Jan 26, 2024

[R] col_types of open_delim_dataset() does not work as described #39811

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R][Docs] Improve documentation of `col_types` #38903

[R][Docs] Improve documentation of `col_types` #38903

assignUser commented Nov 27, 2023 •

edited

ShaiviAgarwal2 commented Jan 8, 2024

assignUser commented Jan 9, 2024

ShaiviAgarwal2 commented Jan 9, 2024

ShaiviAgarwal2 commented Jan 13, 2024

ShaiviAgarwal2 commented Jan 13, 2024

assignUser commented Jan 14, 2024 •

edited

joelnitta commented Jan 24, 2024

[R][Docs] Improve documentation of col_types #38903

[R][Docs] Improve documentation of col_types #38903

Comments

assignUser commented Nov 27, 2023 • edited

Describe the enhancement requested

Component(s)

ShaiviAgarwal2 commented Jan 8, 2024

assignUser commented Jan 9, 2024

ShaiviAgarwal2 commented Jan 9, 2024

ShaiviAgarwal2 commented Jan 13, 2024

ShaiviAgarwal2 commented Jan 13, 2024

assignUser commented Jan 14, 2024 • edited

joelnitta commented Jan 24, 2024

[R][Docs] Improve documentation of `col_types` #38903

[R][Docs] Improve documentation of `col_types` #38903

assignUser commented Nov 27, 2023 •

edited

assignUser commented Jan 14, 2024 •

edited