regexp_matches doesn't find all matches #17812
Replies: 1 comment
-
|
Why do you need a The regular expression looks OK. I find Regexper useful. I'm not sure how performant the R extension is, but my instinct would be to do as much as you can in the SQL layer to keep it in DuckDB for speed. Do you have a DuckDB database on disk with your tables, or are you doing this all in memory on R data tables? Are you saying that you have regular expressions 108 characters in length? Long regular expressions could be computationally expensive. Are you sure you need them? If It's a bit difficult to give advice without knowing what your actual data is. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello
I have two data.tables, dt1 containing the variables (id,type) and dt2 with the variables (id,codes). I then want a new datatable containing (id,type,codes) where the id's are from dt1 and codes have to be some specific codes. To get this I am trying to do a left join and then only keep the rows where regexp_matches returns TRUE. If I do this in only one query, duckdb returns too few rows, where as if I do the left join in duckdb and then use grepl() in R to do the filtering, I get the expected result. Have I coded it wrong, or is this a common issue with regexp_matches? I suspect it has something to do with the length of codes I want to include, as the 'wrong code' results vary when I do the duckdb query in batches. The regular expression list of codes I want to include in the final datatable have a length of 108, with multiple emtries in the form "HA[0-8]". Also the data.tables are quite large: dt1 has 0.7 million rows and dt2 has over 300 million rows.
Example code:
Wrong code:
dt3 <- data.table::setDT(dbGetQuery(con, "
SELECT dt1.*, dt2.codes
FROM dt1
LEFT JOIN dt2 ON dt1.id = dt2.id
WHERE regexp_matches(dt2.codes, '^(a|b|c[3-9])' )
"))
Code with correct result
dt3 <- data.table::setDT(dbGetQuery(con, "
SELECT dt1.*, dt2.codes
FROM dt1
LEFT JOIN dt2 ON dt1.id = df2.id
"))
my_codes <- c("^a", "^b", "^c[3-9]")
regex <- paste(my_codes, collapse = "|")
dt3_filtered <- dt3[grepl(regex, codes)]
Appreciate any help.
Beta Was this translation helpful? Give feedback.
All reactions