feat: Add col.types argument to duckdb_read_csv()#445
feat: Add col.types argument to duckdb_read_csv()#445krlmlr merged 17 commits intoduckdb:mainfrom eli-daniels:resolve-conflict
col.types argument to duckdb_read_csv()#445Conversation
|
Thanks for the effort, good idea! This function is getting too complex now, is there a way to split this into more focused functions, one of which would be responsible for inferring the column types? Can you propose a refactoring that will later make it easy to add this functionality? The branch now also conflicts with the main branch, sorry about that. |
R/csv.R
Outdated
| #' @param col.names Override the detected or generated column names | ||
| #' @param col.types Character vector of column types in the same order as col.names, | ||
| #' or a named character vector where names are column names and types pairs. | ||
| #' Valid ypes are DuckDB data types, e.g. VARCHAR, DOUBLE, DATE, BIGINT, BOOLEAN, etc. |
There was a problem hiding this comment.
instead of saying etc, can we point to a documentation somewhere?
tests/testthat/test-read.R
Outdated
| col.types = c( | ||
| Sepal.Length = "DOUBLE", | ||
| Sepal.Width = "DOUBLE", | ||
| Petal.Length = "DOUBLE", | ||
| Petal.Width = "DOUBLE", | ||
| Species = "DOUBLE" | ||
| ) |
There was a problem hiding this comment.
There are no tests for date types, feels risky
|
Yep, I'll give it a shot and add some tests and better docs. Is there any way to avoid recompiling DuckDB when doing RCMD checks? |
|
I use |
|
I've refactored it a bit and added a DATE type test. Ready for review. If there is anything else I can do to make it better, let me know. RCMD Check gives : [ FAIL 0 | WARN 0 | SKIP 52 | PASS 4810 ] |
krlmlr
left a comment
There was a problem hiding this comment.
Thanks, looks like CI/CD is failing.
|
Thanks for the review @krlmlr |
|
Thanks! This is a good start, but ideally, I'd like to proceed as outlined in #118 (comment) . Merging for now, let's see. |
col.types argument to duckdb_read_csv()
|
Agreed, but this adds the functionality asked across a few issues. Hopefully it helps someone. I'm still thinking about how to approach the method discussed in #118 |
This addresses 2 issues: #118 and #383
col.types can be given as a named character vector providing the names and types of the columns such as:
col.types = c(col0 = 'VARCHAR', col1 = 'COUBLE' , etc..)or an unnamed character vector, then col names are taken from the read.csv output or the col.names argument. Column names given by col.types are preferred over col.names.The data types are provided as DuckDB data types, so VARCHAR, DOUBLE, BIGINT, etc...
As part of this I changed dbWriteTable to dbCreateTable. So could easily add the temporary parameter mentioned here #142, but don't want to get ahead of myself.
Also made a minor addition to the docs to mention that the csv files are appended to the table if the table already exists, I hope that is okay.
Happy to modify anything that may be amiss!
Closes #383.