Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Update docs to clarify that stringsAsFactors isn't relevant for parquet/feather #24055

Closed
asfimport opened this issue Feb 10, 2020 · 7 comments

Comments

@asfimport
Copy link

asfimport commented Feb 10, 2020

Same issue as reported for feather::read_feather (#24054);

 

For the R arrow package, the "read_parquet()" function currently does not respect "options(stringsAsFactors = FALSE)", leading to unexpected/inconsistent behavior.

 

Example:

 

 

library(arrow)
library(readr)
options(stringsAsFactors = FALSE)
write_tsv(head(iris), 'test.tsv')
write_parquet(head(iris), 'test.parquet')
head(read.delim('test.tsv', sep='\t')$Species)
# [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
head(read_tsv('test.tsv', col_types = cols())$Species)
# [1] "setosa" "setosa" "setosa" "setosa" "setosa" "setosa"
head(read_parquet('test.parquet')$Species)
# [1] setosa setosa setosa setosa setosa setosa
# Levels: setosa versicolor virginica

 

 

Versions:

  • R 3.6.2

  • arrow_0.15.1.9000

Environment: Linux 64-bit 5.4.15
Reporter: Keith Hughitt / @khughitt
Assignee: Neal Richardson / @nealrichardson

Related issues:

Note: This issue was originally created as ARROW-7825. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
I'm not sure this is valid. iris$Species already is factor, so that's preserved when you write to Parquet or Feather, but when you write to a CSV, that is lost. So stringsAsFactors is irrelevant because it's not a string in the Parquet/Feather file, unlike in a CSV.

@asfimport
Copy link
Author

Keith Hughitt / @khughitt:
@nealrichardson That's a fair argument and a good point. Preserving the actual type of iris$Species is certainly preferred.

 

The down-side is still that the rest of the R ecosystem read_xx functions are generally not going to behave this way, so I think many users are going to be caught off-guard by this (speaking from experience..)

 

I'm not sure what the best solution is here. In principle, I agree that the current behavior is the most sensible, so perhaps it is just a matter of educating the community to be aware of these differences when working with filetypes that are able to properly encode factor variables.

 

Perhaps just including a note in the read_feather() and read_parquet() docs mentioning this expected difference in behavior compared with the other read functions?

@asfimport
Copy link
Author

Francois Saint-Jacques / @fsaintjacques:
Side note, the Arrow CSV reader has the options to parse a given column as dictionary type.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
@khughitt Try writing iris to feather and parquet with Species as a character column and reading it back. My guess is that it will stay a character/string column and also will ignore stringsAsFactors.

Leaving aside the merits of the stringsAsFactors feature itself and its historical origins, I think it only really makes sense when you're reading a text file like a CSV and you don't have type information available. For binary formats with metadata, it makes sense to preserve what was saved. When you readRDS it doesn't convert your character vectors to factors, for example:

> options(stringsAsFactors=FALSE)
> tf <- tempfile()
> saveRDS(iris, tf)
> str(readRDS(tf))
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

> iris$Species <- as.character(iris$Species)
> str(iris)
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
> saveRDS(iris, tf)
> str(readRDS(tf))
'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

@asfimport
Copy link
Author

Keith Hughitt / @khughitt:
@nealrichardson Indeed read_feather and read_parquet both ignore stringsAsFactors when loading character columns, always preserving the proper character type.

 

I agree that this is the expected and desired behavior. I can close both this and the related "read_feather()" issue I reported.

 

Do you think it's worth including a note in the docs for the methods to caution users who aren't familiar with parquet/feather's handling of column types?

 

It's true that most users should already have some experience with this with readRDS, however, I still suspect that other users will see the similarity of read_feather, read_parquet, read_tsv, etc., and not appreciate the differences between the methods.

 

Your call though. Either way, I appreciate you taking the time to respond and clarify the important differences between the methods.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Sure, if you want to add a note to the docs for these functions that stringsAsFactors is ignored/irrelevant (as-if FALSE, which of course it should always be anyway ;) because you're saving a rich binary file format, I'll approve such a PR. If it was surprising to you, it may be surprising to others.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Now that wisdom has prevailed and stringsAsFactors=FALSE by default in R 4.0, I don't think we need to add anything to the arrow docs. Feel free to reopen and submit a PR if you feel strongly otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants