Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R session crashing when querying local database #2

Closed
Pakillo opened this issue Jan 13, 2022 · 4 comments · Fixed by #3
Closed

R session crashing when querying local database #2

Pakillo opened this issue Jan 13, 2022 · 4 comments · Fixed by #3

Comments

@Pakillo
Copy link
Member

Pakillo commented Jan 13, 2022

Hi @cboettig,

I've tried your examples for querying a local database and they work fine. But I'm unable to do a basic query of species occurrences (<150 records). What am I doing wrong?

My code:

library(gbifdb)
library(dplyr)  

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
gbif <- tbl(con, "gbif")
gbif

pinsapo <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2010)
pinsapo

When I call 'pinsapo' my R session crashes. I am attaching images of two attempts:

Screenshot from 2022-01-13 15-36-54

Screenshot from 2022-01-13 15-38-15

My session info:

Session info ────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.2 (2021-11-01)
 os       Ubuntu 20.04.3 LTS
 system   x86_64, linux-gnu
 ui       RStudio
 language en_GB:en
 collate  en_GB.UTF-8
 ctype    en_GB.UTF-8
 tz       Europe/Madrid
 date     2022-01-13
 rstudio  1.4.1717 Juliet Rose (desktop)
 pandoc   2.11.4 @ /usr/lib/rstudio/bin/pandoc/ (via rmarkdown)

─ Packages ────────────────────────────────────────────────────────────────────────────────────────────────
 package       * version date (UTC) lib source
 arrow           6.0.1   2021-11-20 [1] RSPM (R 4.1.0)
 assertthat      0.2.1   2019-03-21 [1] RSPM (R 4.1.0)
 aws.s3          0.3.21  2020-04-07 [1] RSPM (R 4.1.0)
 aws.signature   0.6.0   2020-06-01 [1] RSPM (R 4.1.0)
 base64enc       0.1-3   2015-07-28 [1] RSPM (R 4.1.0)
 bit             4.0.4   2020-08-04 [1] RSPM (R 4.1.0)
 bit64           4.0.5   2020-08-30 [1] RSPM (R 4.1.0)
 blob            1.2.2   2021-07-23 [1] RSPM (R 4.1.0)
 cli             3.1.0   2021-10-27 [1] RSPM (R 4.1.0)
 codemeta        0.1.1   2021-12-22 [1] RSPM (R 4.1.0)
 crayon          1.4.2   2021-10-29 [1] CRAN (R 4.1.1)
 curl            4.3.2   2021-06-23 [1] RSPM (R 4.1.0)
 DBI             1.1.2   2021-12-20 [1] RSPM (R 4.1.0)
 dbplyr          2.1.1   2021-04-06 [1] RSPM (R 4.1.0)
 digest          0.6.29  2021-12-01 [1] RSPM (R 4.1.0)
 dplyr         * 1.0.7   2021-06-18 [1] RSPM (R 4.1.0)
 duckdb          0.3.2   2022-01-12 [1] local
 ellipsis        0.3.2   2021-04-29 [1] RSPM (R 4.1.0)
 evaluate        0.14    2019-05-28 [1] RSPM (R 4.1.0)
 fansi           0.5.0   2021-05-25 [1] RSPM (R 4.1.0)
 fastmap         1.1.0   2021-01-25 [1] RSPM (R 4.1.0)
 gbifdb        * 0.1.0   2022-01-13 [1] Github (cboettig/gbifdb@3cf5429)
 generics        0.1.1   2021-10-25 [1] RSPM (R 4.1.0)
 glue            1.6.0   2021-12-17 [1] RSPM (R 4.1.0)
 htmltools       0.5.2   2021-08-25 [1] RSPM (R 4.1.0)
 httr            1.4.2   2020-07-20 [1] RSPM (R 4.1.0)
 jsonlite        1.7.2   2020-12-09 [1] RSPM (R 4.1.0)
 knitr           1.37    2021-12-16 [1] RSPM (R 4.1.0)
 lifecycle       1.0.1   2021-09-24 [1] RSPM (R 4.1.0)
 magrittr        2.0.1   2020-11-17 [1] RSPM (R 4.1.0)
 pillar          1.6.4   2021-10-18 [1] RSPM (R 4.1.0)
 pkgconfig       2.0.3   2019-09-22 [1] RSPM (R 4.1.0)
 purrr           0.3.4   2020-04-17 [1] RSPM (R 4.1.0)
 R6              2.5.1   2021-08-19 [1] RSPM (R 4.1.0)
 rlang           0.4.12  2021-10-18 [1] RSPM (R 4.1.0)
 rmarkdown       2.11    2021-09-14 [1] RSPM (R 4.1.0)
 rstudioapi      0.13    2020-11-12 [1] RSPM (R 4.1.0)
 sessioninfo     1.2.2   2021-12-06 [1] RSPM (R 4.1.0)
 tibble          3.1.6   2021-11-07 [1] RSPM (R 4.1.0)
 tidyselect      1.1.1   2021-04-30 [1] RSPM (R 4.1.0)
 utf8            1.2.2   2021-07-24 [1] RSPM (R 4.1.0)
 vctrs           0.3.8   2021-04-29 [1] RSPM (R 4.1.0)
 xfun            0.29    2021-12-14 [1] RSPM (R 4.1.0)
 xml2            1.3.3   2021-11-30 [1] RSPM (R 4.1.0)
 yaml            2.2.1   2020-02-01 [1] RSPM (R 4.1.0)

 [1] /home/frs/R/x86_64-pc-linux-gnu-library/4.0
 [2] /opt/R/4.1.2/lib/R/library

This is the last bit I need to solve to finish my review!

Thanks

@cboettig
Copy link
Member

cboettig commented Jan 13, 2022

@Pakillo Thanks so much for the detailed report! I can reproduce this crash. Looks like this might be a duckdb issue, possibly stemming from the nature of the parquet data with how filters work when all columns are returned. I need to play around a bit more and probably follow up with the duckdb devs.

For comparison, can you try the same query in pure arrow, e.g.

gbif <- arrow::open_dataset("/home/shared-data/gbif/occurrence/2021-11-01/occurrence.parquet/")
pinsapo <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2010)
pinsapo %>% collect()

That query works for me (though is still somewhat intensive).

Also would be great if you could test with adding a select() call to get only a subset of columns back (both in arrow and the native duckdb call). It looks like it makes a huge difference to select() only the columns you need when querying parquet (for both local duckdb and remote arrow connections).

(Obviously these are things we'd want to at least document! and it's possible we can avoid them, at least on the local side, by reading the parquet files into the native duckdb backend first. I'm totally still learning here so thanks for exploring this with me!)

@Pakillo
Copy link
Member Author

Pakillo commented Jan 13, 2022

Thanks for the quick reply @cboettig!

You're totally right about select. Once I select just a few columns to return, it works! And quite fast.

con <- gbif_conn(dir = "/media/frs/Elements/gbifdb/occurrence.parquet/")
gbif <- tbl(con, "gbif")

pinsapo <- gbif %>% 
  filter(species == "Abies pinsapo",
         countrycode == "ES",
         year > 2010) %>% 
  select(species, year, decimallatitude, decimallongitude)

pinsapo

BUT from several tests I've done selecting different columns, it looks to me the problem may be with the last two columns in the database: mediatype and issue. I see here that these two columns are special (string arrays), compared to all the other columns, that are strings, double, etc.

So, I can select up to 48 columns from the 50 available without any problem, as long as I don't select any of those two columns.

Does this make sense? What can be done? (apart from documenting this in the package, of course). If you avoid those two columns it seems all my queries work fine

@cboettig
Copy link
Member

nice tracking that down! yeah that makes sense! Yeah, I think we can potentially work around this issue then by excluding those 2 columns by default (e.g. internally inside gbif_remote() and gbif_local() ? (Thanks to @stephhazlitt I've added a gbif_local() as a more intuitive parallel to gbif_remote(); lives on the patch-1 branch for now. Also gbif_local() is now using duckdb-via-arrow as the default backend instead of pure duckb since it seems a bit more robust to this issue)

@Pakillo
Copy link
Member Author

Pakillo commented Jan 13, 2022

Excluding those two columns looks like a good solution to me, at least by now 👍

I had submitted the review before seeing your response, just to move things forward, but I'm happy to keep an eye if I can be of help

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants