Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Read CSV with comma as decimal mark #29184

Closed
Tracked by #33370
asfimport opened this issue Aug 2, 2021 · 2 comments · Fixed by #38002
Closed
Tracked by #33370

[R] Read CSV with comma as decimal mark #29184

asfimport opened this issue Aug 2, 2021 · 2 comments · Fixed by #38002
Assignees
Milestone

Comments

@asfimport
Copy link
Collaborator

Followup to ARROW-13421. There is a new ConvertOption, that part is easy. There may be some subtleties in emulating the readr way of supporting this since it uses a broader locale() object, but maybe we just add read_csv2_arrow (matching readr::read_csv2 and base::read.csv2) and that's enough.

Reporter: Neal Richardson / @nealrichardson

Note: This issue was originally created as ARROW-13531. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Dewey Dunnington / @paleolimbot:
Reprex:

library(arrow, warn.conflicts = FALSE)

tf <- tempfile()
write("col1;col2\n1,23;val1\n4,56;val2\n", tf)

# how it's done elswhere
read.csv2(tf)
#>   col1 col2
#> 1 1.23 val1
#> 2 4.56 val2
readr::read_csv2(tf, show_col_types = FALSE)
#> ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
#> # A tibble: 2 × 2
#>    col1 col2 
#>   <dbl> <chr>
#> 1  1.23 val1 
#> 2  4.56 val2
readr::read_delim(
  tf,
  delim = ";",
  locale = readr::locale(decimal_mark = ","),
  show_col_types = FALSE
)
#> # A tibble: 2 × 2
#>    col1 col2 
#>   <dbl> <chr>
#> 1  1.23 val1 
#> 2  4.56 val2

# possible syntax in arrow::read_csv_arrow()
read_csv_arrow(
  tf,
  parse_options = CsvParseOptions$create(delimiter = ";"),
  convert_options = CsvConvertOptions$create(decimal_point = ",")
)
#> Error in CsvConvertOptions$create(decimal_point = ","): unused argument (decimal_point = ",")

read_csv2_arrow(tf)
#> Error in read_csv2_arrow(tf): could not find function "read_csv2_arrow"

Where the CsvConvertOptions are defined:

arrow/r/R/csv.R

Lines 526 to 559 in 670af33

CsvConvertOptions$create <- function(check_utf8 = TRUE,
null_values = c("", "NA"),
true_values = c("T", "true", "TRUE"),
false_values = c("F", "false", "FALSE"),
strings_can_be_null = FALSE,
col_types = NULL,
auto_dict_encode = FALSE,
auto_dict_max_cardinality = 50L,
include_columns = character(),
include_missing_columns = FALSE,
timestamp_parsers = NULL) {
if (!is.null(col_types) && !inherits(col_types, "Schema")) {
abort(c(
"Unsupported `col_types` specification.",
i = "`col_types` must be NULL, or a <Schema>."
))
}
csv___ConvertOptions__initialize(
list(
check_utf8 = check_utf8,
null_values = null_values,
strings_can_be_null = strings_can_be_null,
col_types = col_types,
true_values = true_values,
false_values = false_values,
auto_dict_encode = auto_dict_encode,
auto_dict_max_cardinality = auto_dict_max_cardinality,
include_columns = include_columns,
include_missing_columns = include_missing_columns,
timestamp_parsers = timestamp_parsers
)
)
}

arrow/r/src/csv.cpp

Lines 79 to 149 in 670af33

std::shared_ptr<arrow::csv::ConvertOptions> csv___ConvertOptions__initialize(
cpp11::list options) {
auto res = std::make_shared<arrow::csv::ConvertOptions>(
arrow::csv::ConvertOptions::Defaults());
res->check_utf8 = cpp11::as_cpp<bool>(options["check_utf8"]);
// Recognized spellings for null values
res->null_values = cpp11::as_cpp<std::vector<std::string>>(options["null_values"]);
// Whether string / binary columns can have null values.
// If true, then strings in "null_values" are considered null for string columns.
// If false, then all strings are valid string values.
res->strings_can_be_null = cpp11::as_cpp<bool>(options["strings_can_be_null"]);
res->true_values = cpp11::as_cpp<std::vector<std::string>>(options["true_values"]);
res->false_values = cpp11::as_cpp<std::vector<std::string>>(options["false_values"]);
SEXP col_types = options["col_types"];
if (Rf_inherits(col_types, "Schema")) {
auto schema = cpp11::as_cpp<std::shared_ptr<arrow::Schema>>(col_types);
std::unordered_map<std::string, std::shared_ptr<arrow::DataType>> column_types;
for (const auto& field : schema->fields()) {
column_types.insert(std::make_pair(field->name(), field->type()));
}
res->column_types = column_types;
}
res->auto_dict_encode = cpp11::as_cpp<bool>(options["auto_dict_encode"]);
res->auto_dict_max_cardinality =
cpp11::as_cpp<int>(options["auto_dict_max_cardinality"]);
res->include_columns =
cpp11::as_cpp<std::vector<std::string>>(options["include_columns"]);
res->include_missing_columns = cpp11::as_cpp<bool>(options["include_missing_columns"]);
SEXP op_timestamp_parsers = options["timestamp_parsers"];
if (!Rf_isNull(op_timestamp_parsers)) {
std::vector<std::shared_ptr<arrow::TimestampParser>> timestamp_parsers;
// if we have a character vector, convert to arrow::StrptimeTimestampParser
if (TYPEOF(op_timestamp_parsers) == STRSXP) {
cpp11::strings s_timestamp_parsers(op_timestamp_parsers);
for (cpp11::r_string s : s_timestamp_parsers) {
timestamp_parsers.push_back(arrow::TimestampParser::MakeStrptime(s));
}
} else if (TYPEOF(op_timestamp_parsers) == VECSXP) {
cpp11::list lst_parsers(op_timestamp_parsers);
for (SEXP x : lst_parsers) {
// handle scalar string and TimestampParser instances
if (TYPEOF(x) == STRSXP && XLENGTH(x) == 1) {
timestamp_parsers.push_back(
arrow::TimestampParser::MakeStrptime(CHAR(STRING_ELT(x, 0))));
} else if (Rf_inherits(x, "TimestampParser")) {
timestamp_parsers.push_back(
cpp11::as_cpp<std::shared_ptr<arrow::TimestampParser>>(x));
} else {
cpp11::stop(
"unsupported timestamp parser, must be a scalar string or a "
"<TimestampParser> object");
}
}
} else {
cpp11::stop(
"unsupported timestamp parser, must be character vector of strptime "
"specifications, or a list of <TimestampParser> objects");
}
res->timestamp_parsers = timestamp_parsers;
}
return res;
}

/// Decimal point character for floating-point and decimal data
char decimal_point = '.';

@thisisnic
Copy link
Member

@paleolimbot Your instructions here are 🔥

thisisnic added a commit that referenced this issue Oct 9, 2023
### Rationale for this change

Allow customisable decimal points when reading data

### What changes are included in this PR?

Expose the C++ option in R

### Are these changes tested?

Aye

### Are there any user-facing changes?

Indeed
* Closes: #29184

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 14.0.0 milestone Oct 9, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 23, 2023
### Rationale for this change

Allow customisable decimal points when reading data

### What changes are included in this PR?

Expose the C++ option in R

### Are these changes tested?

Aye

### Are there any user-facing changes?

Indeed
* Closes: apache#29184

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
### Rationale for this change

Allow customisable decimal points when reading data

### What changes are included in this PR?

Expose the C++ option in R

### Are these changes tested?

Aye

### Are there any user-facing changes?

Indeed
* Closes: apache#29184

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change

Allow customisable decimal points when reading data

### What changes are included in this PR?

Expose the C++ option in R

### Are these changes tested?

Aye

### Are there any user-facing changes?

Indeed
* Closes: apache#29184

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants