Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with IDate data type after readind a CSV with import #293

Closed
jllipatz opened this issue Dec 9, 2021 · 5 comments
Closed

Problems with IDate data type after readind a CSV with import #293

jllipatz opened this issue Dec 9, 2021 · 5 comments

Comments

@jllipatz
Copy link

jllipatz commented Dec 9, 2021

Hello,

I am experiencing non deterministic problems with a data frame containing an IDate column. In the following example I kept only two columns of the original file (donnees-hospitalieres-covid19-2021-12-08-19h05.csv or the same for a different date from https://www.data.gouv.fr/fr/datasets/donnees-hospitalieres-relatives-a-lepidemie-de-covid-19/). The file was read using import without additional option.
Replacing the call to dplyr by df2 <- df[df$sexe==0,] make the things go well again. So I don't really know where is the problem : in rio, in dplyr or in the IDate type from data.table.

If the problem is related with the IDate type wouldn't it be possible to use the standard type Date instead, overriding what fread did?

From a fresh session under R 4.1.1 :

> library(dplyr)

Attachement du package : ‘dplyr’

Les objets suivants sont masqués depuis ‘package:stats’:

    filter, lag

Les objets suivants sont masqués depuis ‘package:base’:

    intersect, setdiff, setequal, union

> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame':   191540 obs. of  2 variables:
 $ sexe: int  0 1 2 0 1 2 0 1 2 0 ...
 $ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365

> library(rio)
The following rio suggested packages are not installed: ‘arrow’, ‘hexView’, ‘jsonlite’, ‘pzfx’, ‘readODS’, ‘rmarkdown’, ‘rmatio’
Use 'install_formats()' to install them
> df <- readRDS("U:/PbIDate.RDS")
> str(df)
'data.frame':   191540 obs. of  2 variables:
 $ sexe: int  0 1 2 0 1 2 0 1 2 0 ...
 $ jour: IDate, format: "2020-03-18" "2020-03-18" ...
> df2 <- filter(df,sexe==0)
> df$b <- df$jour-365
> df2$b <- df2$jour-365
Erreur dans `-.IDate`(df2$jour, 365) : 
  Internal error: storage mode of IDate is somehow no longer integer
@jsonbecker
Copy link
Collaborator

jsonbecker commented May 4, 2022

I can't reproduce this issue:

library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
df$jour
sapply(df,class)
export(df, format = 'RDS')
import('df.rds')
df <- import('df.rds')
df$jour

All works fine for me.

R version 4.1.0 (2021-05-18)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 12.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] rio_0.5.29

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8.3      fansi_1.0.3       utf8_1.2.2        crayon_1.5.1
 [5] cellranger_1.1.0  lifecycle_1.0.1   magrittr_2.0.2    zip_2.2.0
 [9] pillar_1.7.0      stringi_1.7.6     rlang_1.0.2       cli_3.2.0
[13] readxl_1.3.1      curl_4.3.2        data.table_1.14.2 vctrs_0.3.8
[17] ellipsis_0.3.2    openxlsx_4.2.5    tools_4.1.0       forcats_0.5.1
[21] foreign_0.8-81    glue_1.6.2        hms_1.1.1         compiler_4.1.0
[25] pkgconfig_2.0.3   haven_2.4.1       tibble_3.1.6

@schochastics
Copy link
Member

schochastics commented Sep 11, 2023

The problem is reproducible but I think it is a problem of dplyr. Something similar in data.table was discussed here: Rdatatable/data.table#2008

library(rio)
download.file('https://www.data.gouv.fr/fr/datasets/r/63352e38-d353-4b54-bfd1-f1b3ee1cabd7',
destfile = 'covid.csv')
df <-import('covid.csv')
export(df, "df.RDS")
df <- import('df.RDS')
df$b <- df$jour-365
df2 <- dplyr::filter(df,sexe==0)
df2$b <- df2$jour-365
#> Error in `-.IDate`(df2$jour, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(df$jour)
#> [1] "integer"
storage.mode(df2$jour)
#> [1] "double"

Created on 2023-09-11 with reprex v2.0.2

here is a minimal example that is independent of rio

library(data.table)
library(dplyr)

df <- data.table(a=Sys.Date(),b=14)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "double"

tb <- as_tibble(df)
tb$a-365
#> [1] "2022-09-11"
storage.mode(tb$a)
#> [1] "double"

df$a <- as.IDate(df$a)
df$a-365
#> [1] "2022-09-11"
storage.mode(df$a)
#> [1] "integer"


tb <- as_tibble(df) |> dplyr::filter(a>=Sys.Date())
tb$a-365
#> Error in `-.IDate`(tb$a, 365): Internal error: storage mode of IDate is somehow no longer integer
storage.mode(tb$a)
#> [1] "double"

Created on 2023-09-11 with reprex v2.0.2

@chainsawriot Dn't think we need to do anything in rio, but is this something to escalate to the dplyr team?

Edit:
see tidyverse/dplyr#6687 and tidyverse/dplyr#6230

@schochastics
Copy link
Member

Ok this appears to be an open issue in vctrs: r-lib/vctrs#1781
I will close this here

@chainsawriot
Copy link
Collaborator

@schochastics Thank you very much for the investigation.

I tried dtplyr also, the same.

@jsonbecker
Copy link
Collaborator

Oh fascinating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants