New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
invalid unicode in segments statistics #1650
Comments
Unrelated question, why are you using the JDBC driver with RStudio? We have a dedicated R package.... |
database client using jdbc. Rpackage for code. I did not explain myself properly. |
No, the issue should certainly not be size related. But it would help us a lot if you could try to create a minimal reproducible example that demonstrates the issue. |
Sadly , data is protected. Just any data file with more than 600 columns and 500.000 rows I assume will give the same result. I will try to generate such file. |
Great thanks. Otherwise there is little we can do. |
Same issue here. What are the possibile issues in this case? ( |
What API are you using and how are you loading data into the database? Could you make a reproducible example of the problem? This error is always a bug as the system should convert all data to UTF8 when ingesting the data. Perhaps one of the ingestion methods does not convert or verify correctly. |
R API. Just tried with a small subset (100 lines) and it worked. The whole thing is 4.2gb so I'm not quite sure how to debug it. Will try some more sampling :| |
Alright, so I just subset my data until I pinpoint an entry that got this "SN\xaa" char. Is that a known bug? |
That is not a known bug. Could you please file a bug report? Thanks! |
@Mytherin , I'm having the same issue with the file below "test.csv":
which throws this error message: I believe the problem is with the UTF-8 encoding because when I import the .csv without the encoding option, it works (however the special character from my language are lost). |
I also encountered this while using |
@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release. |
You're welcome @hannesmuehleisen . Would you consider reopen this issue? |
I've had this issue now, too. I used the R package and dbplyr. It was a join resulting in a biggish table (4 billion records, 2 columns). I was able to use tally to find that out (as I wanted to find out pulling it into R is feasible), but not compute to create the result. That's of course the range of records where duckdb'd be handy to use for as R user. Disk.frame did the job. |
I'm having the same issue using Each CSV file contains a table with ~1M rows; example is 1,161,542 rows and 73 columns- with doctor information, drug prescribed and other details (each year in a different file; using only 2020 for illustration- occurred with earlier years too).
note: I'm using the
Session Info:
|
I can get around this error by parsing data into memory via I've been able to read the first 7 files of 24M each using this method, but do get an error with the latest 2020 file still... I've used this code which works, except for the last file (2020) which throws the unicode error (temp df is just file names and year):
Tried The 2020 data file (MUP_DPR_RY22_P04_V10_DY20_NPIBN_0.csv) reads into memory without issue- same error occurs when trying to append to duckdb somewwere between row 20,500,000 - 20,600,000 (saves ok for row 20.6M:end).
|
I found the culprit in my data. The issue is clearly with non-recognized characters in the data file. I identified the location of "non-ascii" characters in the rows that throw the error (20,500,000 - 20,600,000) using this code, and then fixed it. Once that single value was fixed, I was able to save to the duckdb database.
Interestingly, there were a number of non-ascii characters in sections that did not cause an error, and did not require "fixing" for the database to save. Here are those just in case it helps in some way:
|
It would seem we need to double-check the UTF verification for R data frame writing |
hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value |
Could you share the Parquet file with us (e.g. send to my email at |
sorry, the file data is secret, I write parquet file useing spark fix this exception. thanks |
We've had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven't got a minimal reproducible example (because, being on mac, i've never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex |
How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements? |
Via Pandas dataframes |
It is likely the unicode ingestion verification there is missing or incomplete - we can have a look to see if we can create a reproducible example by intentionally creating pandas data frames with bogus unicode data. |
I can confirm that the error I was having is gone, see BigBangData/BookReviews@f212584 |
Same here. Thanks! |
Or I should say...now I got a different error. =) Same query I posted above now causes an OOM exception. Full report here: #5315 |
Seeing similar issue on duckdb 0.6.0:
Note:
|
Thanks for the report! Any chance you could share the data with us (could be done privately by e-mail - mark@duckdblabs.com)? |
Unfortunately I cannot share this data. But perhaps someone has created a data masking/randomization tool for duckdb tables or parquet data? I think for many cases (including this one) it is the data size and shape, not the actual particulars of the values that is causing the issue. |
I created a gist here previously that uses the |
Apologies for my delay. DM sent with scrambled data. For this case I had to implement my own script to scramble parquet data, which others might find useful https://gist.github.com/AlexanderVR/d2ed810799be4649446ef0d51364a404 There is some subtlety here, as only changing the row group structure or the nullability masks of the data will cause the issue to disappear. |
For those who read this issue, I too ran into it and when working with R, I could clean up a list of dataframes like so: my_dataframes |>
map(function(x) x |> mutate(across(
where(is.character),
function(x) stringi::stri_encode(x, to = "UTF-8")
))) After this the error disappeared. |
any workaround? i'm using appender to insert cyrillic strings, throws exception "Invalid unicode (byte sequence mismatch) detected in segment statistics update" |
The data has to be valid UTF8 when inserted into the system. If you are inserting data that is not valid UTF8 then you should convert it prior to inserting. |
I'm sorry, found ugly workaround for .NET DuckDB driver (issue was there) call "duckdb_append_varchar" and arg byte array instead of the .NET string
And exception goes away. Sorry for disturbing :) |
I just play with the command line on a windows box, get said exception with a file that is Win-1252 encoded. I understand the issue. I exported a CSV file from online banking.
|
I am also seeing this error when running queries. The behavior is pretty strange:
As for the query, I'm doing something like
I thought it might be something with the arrays, because other operations with the column seem fine. But I'm not sure. Any idea what I can investigate to make things work? Thanks. |
Could you share a parquet file containing the problematic column with us? If you can’t publish it publicly feel free to send me an email with the data. |
Sadly I can't - it is confidential. What kind of digging can I do on my end, that would help? I'll at least try to chase few versions of the data (we should have Spark written copy and Trino written copy and Delta lake copy that might use different writer) to find out if this is Parquet issue or data issue. The CSV export fixing it is pretty suspicious I feel. Also I juste realized this was on 0.7 and on 0.8.0 but today I updated to 0.8.1 I'll chcek if I can reproduce it on 0.8.1. |
I have retested it on 0.8.1. I don't know how I got it working last time, this time it never works no matter whether I go through CSV or parquet or whatever. @lnkuiper I'll go try it. |
@lnkuiper it fixed it! |
@wrb2 Thank you for taking the time to test! Glad this is fixed |
Any idea when this might end up in a new release? |
hi, i'm still hitting this error on the
console output:
|
Thanks for reporting - could you open a new issue in the new duckdb-r repo? |
Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:
Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!
My guess is the dataframe is too wide (635 columns) or too long 500.000 rows
Windows 10. duckdb jdbc connection 0.24 driver, Rstudio
The text was updated successfully, but these errors were encountered: