New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R][Docs] Improve documentation of col_types
#38903
Comments
@assignUser Is this issue resolved? If not, I want to contribute to it!! |
Nope and afaik noone is working on it so feel free to take it on! |
col_types
col_types
@assignUser First, we will go to the documentation and update the doc strings for Next, we will add a direct link in the |
@assignUser Am I thinking in the right direction and are you satisfied with my answer? |
Could you please assign this task to me, I want to contribute to it!! |
I have assigned the issue to you. You can also comment "/take" on an issue and a bot will assign it to you :) |
I would add that the current documentation says that a "compact string representation" of column types is allowable. This is very similar to the wording of {readr}, so without additional explanation I assumed that's what it meant, but that this does not seem to work: library(readr)
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
# works
read_csv(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
#> 2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.22 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.84 0 0 3 4
#> 8 24.4 4 146.7 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4
#> # ℹ 22 more rows
# works
open_csv_dataset(readr_example("mtcars.csv"))
#> FileSystemDataset with 1 csv file
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
# doesn't work
open_csv_dataset(readr_example("mtcars.csv"), col_types = paste(rep("c", 11), collapse = ""))
#> Error:
#> ! Unsupported `col_types` specification.
#> ℹ `col_types` must be NULL, or a <Schema>.
#> Backtrace:
#> ▆
#> 1. └─arrow (local) `<fn>`(...)
#> 2. └─arrow::open_dataset(...)
#> 3. └─DatasetFactory$create(...)
#> 4. └─FileFormat$create(...)
#> 5. └─CsvFileFormat$create(...)
#> 6. └─arrow:::check_csv_file_format_args(dots, partitioning = partitioning)
#> 7. ├─base::do.call(csv_file_format_convert_opts, args)
#> 8. └─arrow (local) `<fn>`(...)
#> 9. ├─base::do.call(csv_convert_options, opts)
#> 10. └─arrow (local) `<fn>`(...)
#> 11. └─rlang::abort(c("Unsupported `col_types` specification.", i = "`col_types` must be NULL, or a <Schema>.")) Created on 2024-01-24 with reprex v2.0.2 Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.2 (2023-10-31)
#> os macOS Sonoma 14.1.2
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype UTF-8
#> tz Asia/Tokyo
#> date 2024-01-24
#> pandoc 3.1.2 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 14.0.0.2 2023-12-02 [1] CRAN (R 4.3.1)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.3.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.3.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.3.0)
#> cli 3.6.2 2023-12-11 [1] CRAN (R 4.3.1)
#> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.3.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
#> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.1)
#> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.1)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.0)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.3.0)
#> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.0)
#> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.1)
#> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.1)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.3.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.3.0)
#> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.1)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.0)
#> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.0)
#> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.3.0)
#> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.1)
#> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
#> styler 1.10.2 2023-08-29 [1] CRAN (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.3.0)
#> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.1)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.1)
#> vroom 1.6.5 2023-12-05 [1] CRAN (R 4.3.1)
#> withr 2.5.2 2023-10-30 [1] CRAN (R 4.3.1)
#> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.1)
#> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.3.1)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
#>
#> ────────────────────────────────────────────────────────────────────────────── |
Describe the enhancement requested
In a recent SO question about using partial schemas in
open_dataset
(which is possible usingcol_types
) even a seasond arrow user did not know about the proper solution.The docs for open_dataset hide a lot of more specialized options behind a
...
and it it's not obvious how to find those as the linked dataset factory page also doesn't show all possibility. Some are explained in the specialized wrapper functions like https://arrow.apache.org/docs/r/reference/open_delim_dataset.html or https://arrow.apache.org/docs/r/reference/csv_convert_options.html but even there col_types is not described in a way that makes it obvious that it is to be used to pass in partial schemas.At the minimum the doc strings for
col_types
should make the intended uses case clear, ideally we should link to the detailed descriptions fromopen_dataset
or find another way to document the possible options more visibly.Component(s)
Documentation, R
The text was updated successfully, but these errors were encountered: