Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid unicode in segments statistics #1650

Closed
verajosemanuel opened this issue Apr 19, 2021 · 62 comments
Closed

invalid unicode in segments statistics #1650

verajosemanuel opened this issue Apr 19, 2021 · 62 comments
Assignees

Comments

@verajosemanuel
Copy link

verajosemanuel commented Apr 19, 2021

Working in R with DuckDB package no issue so far, trying to load a big dataframe into database I got an error I cannot understand:

Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

My guess is the dataframe is too wide (635 columns) or too long 500.000 rows

Windows 10. duckdb jdbc connection 0.24 driver, Rstudio

@hannes
Copy link
Member

hannes commented Apr 19, 2021

Unrelated question, why are you using the JDBC driver with RStudio? We have a dedicated R package....

@verajosemanuel
Copy link
Author

verajosemanuel commented Apr 20, 2021

database client using jdbc. Rpackage for code. I did not explain myself properly.
is this size related? I need to load all my data in one table. Maybe if I split it in different files....

@hannes
Copy link
Member

hannes commented Apr 20, 2021

No, the issue should certainly not be size related. But it would help us a lot if you could try to create a minimal reproducible example that demonstrates the issue.

@verajosemanuel
Copy link
Author

verajosemanuel commented Apr 21, 2021

Sadly , data is protected. Just any data file with more than 600 columns and 500.000 rows I assume will give the same result. I will try to generate such file.

@hannes
Copy link
Member

hannes commented Apr 22, 2021

Great thanks. Otherwise there is little we can do.

@hannes hannes closed this as completed Jul 19, 2021
@albersonmiranda
Copy link

Same issue here. What are the possibile issues in this case? (Invalid unicode detected in segment statistics update!) Is it related to encoding? I may have latin1 chars in my data.

@Mytherin
Copy link
Collaborator

What API are you using and how are you loading data into the database? Could you make a reproducible example of the problem? This error is always a bug as the system should convert all data to UTF8 when ingesting the data. Perhaps one of the ingestion methods does not convert or verify correctly.

@albersonmiranda
Copy link

R API. Just tried with a small subset (100 lines) and it worked. The whole thing is 4.2gb so I'm not quite sure how to debug it. Will try some more sampling :|

@albersonmiranda
Copy link

Alright, so I just subset my data until I pinpoint an entry that got this "SN\xaa" char. Is that a known bug?

@Mytherin
Copy link
Collaborator

That is not a known bug. Could you please file a bug report? Thanks!

@GitHunter0
Copy link

@Mytherin , I'm having the same issue with the file below "test.csv":
test.csv

db_con <- DBI::dbConnect(duckdb::duckdb(), dbdir="./test.duckdb", read_only=FALSE)
data_from_csv <- data.table::fread("./test.csv"), encoding = "UTF-8")
DBI::dbWriteTable(db_con, "test", data_from_csv, overwrite=TRUE) 

which throws this error message: Error in duckdb_execute(res) : duckdb_execute_R: Failed to run query Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

I believe the problem is with the UTF-8 encoding because when I import the .csv without the encoding option, it works (however the special character from my language are lost).

@victorlin
Copy link

I also encountered this while using connection.from_df(df).create('table_name') and CREATE TABLE on a large dataset with various country names. Unfortunately, reproducibility is tricky and I haven't found a good shareable data file, but it seems like this happens more often with low memory set by PRAGMA memory_limit (if I set it to 1GB it always happens).

@hannes
Copy link
Member

hannes commented Jan 11, 2022

@GitHunter0 Thanks for providing a reproducible example, I can confirm the issue at least in the latest CRAN release.

@GitHunter0
Copy link

You're welcome @hannesmuehleisen . Would you consider reopen this issue?

@hannes hannes reopened this Jan 11, 2022
@KnutJaegersberg
Copy link

I've had this issue now, too. I used the R package and dbplyr. It was a join resulting in a biggish table (4 billion records, 2 columns). I was able to use tally to find that out (as I wanted to find out pulling it into R is feasible), but not compute to create the result. That's of course the range of records where duckdb'd be handy to use for as R user. Disk.frame did the job.

@JMLuther
Copy link

JMLuther commented Sep 7, 2022

I'm having the same issue using duckdb 0.5.0, R 4.2.1 running on Windows10.
I'm using open source files from CMS downloadable at CMS medicare data site (click the "download" option and just need one file to reproduce- example is with 2020 the most recent file).

Each CSV file contains a table with ~1M rows; example is 1,161,542 rows and 73 columns- with doctor information, drug prescribed and other details (each year in a different file; using only 2020 for illustration- occurred with earlier years too).
I'm able to write the table only if subset to <200,000 rows. It also fails at a similar limit on other files.
Clarification: full data set reads into memory using readr::read_csv just fine, but subset to find where the error during the duckdb::dbWriteTable step occurs.
I cannot identify any "hidden" unicode characters in the referenced field.

# reproducible example
cms_provider_files <- list.files(path = here::here("data_by_provider/csv_files_by_year/"), pattern = "*.csv", full.names = T)

con = dbConnect(duckdb::duckdb(), 
                dbdir=here::here("data_by_provider/db_byprovider/providerdb.duckdb"))
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 100000) # total rows = 1161542, cols 73
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 150000) # works fine
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = 200000) # ERROR
df_2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf) # ERROR
duckdb::dbWriteTable(con, "test_data", df_2020, overwrite = T) 
# Error: rapi_execute: Failed to run query
# Error: TransactionContext Error: Failed to commit: INTERNAL Error: Invalid unicode detected in segment statistics update!

note: I'm using the readr package to parse the files because the duck_read_csv errors as well. Not sure if related because it occurs on an earlier row (example below). I thought the readr method would get around this, but not so.

duckdb_read_csv(con, "test_df2020", cms_provider_files[8])
# Error: rapi_execute: Failed to run query
# Error: Invalid Input Error: Could not convert string 'K1H 8' to INT32 between line 56321 and 57345 in column 12. Parser options: DELIMITER=',', QUOTE='"', ESCAPE='"' (default), HEADER=1, SAMPLE_SIZE=10240, IGNORE_ERRORS=0, ALL_VARCHAR=0 

Session Info:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dbplyr_2.2.1    duckdb_0.5.0    DBI_1.1.3       forcats_0.5.2   stringr_1.4.1   dplyr_1.0.9     purrr_0.3.4     readr_2.1.2     tidyr_1.2.0    
[10] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2

@hannes hannes self-assigned this Sep 8, 2022
@JMLuther
Copy link

JMLuther commented Sep 10, 2022

I can get around this error by parsing data into memory via readr::read_csv and explicit column format (using col_type=list(...)), and then passing each result to duckdb::dbWriteTable(...)

I've been able to read the first 7 files of 24M each using this method, but do get an error with the latest 2020 file still...
(additional file set at CMS.gov...which is my intended, more detailed/larger data set, rather than the first one I listed above). I believe data is too large to read into memory all files at once.

I've used this code which works, except for the last file (2020) which throws the unicode error (temp df is just file names and year):

walk2(temp$cms_provider_files, temp$year,
        ~{
          readr::read_csv(.x, n_max = Inf, col_types = cms_providerdb_spec) %>% 
            mutate(year = .y) %>% 
            duckdb::dbWriteTable(conn = con_provider, name =  paste0("cms_providerdb_", .y), value = .,  overwrite = T)
        })

Tried duckdb_read_csv() to read all, but it is not letting me pass the column specification via colClasses argument, and I also want to add a column indicating the year.

The 2020 data file (MUP_DPR_RY22_P04_V10_DY20_NPIBN_0.csv) reads into memory without issue- same error occurs when trying to append to duckdb somewwere between row 20,500,000 - 20,600,000 (saves ok for row 20.6M:end).

df_puf2020 <- readr::read_csv(cms_provider_files[8], n_max = Inf, col_types = cms_providerdb_spec) 
nrow(df_puf2020)
# [1] 25209729
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[1:20500000, ],  overwrite = T) # OK
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20600000:25209729, ],  overwrite = F, append = T) # OK
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20500001:20599999, ],  overwrite = F, append = T) # Error
duckdb::dbWriteTable(conn = con_provider, name =  "cms_providerdb_2020", value = df_puf2020[20500001:20550001, ],  overwrite = F, append = T) # OK

@JMLuther
Copy link

I found the culprit in my data. The issue is clearly with non-recognized characters in the data file. I identified the location of "non-ascii" characters in the rows that throw the error (20,500,000 - 20,600,000) using this code, and then fixed it. Once that single value was fixed, I was able to save to the duckdb database.

map(df_puf2020, ~which(!stringi::stri_enc_isascii(.x)))
# $Prscrbr_City
df_puf2020$Prscrbr_City[20565445] # "Mayag\xe3\x9cez"
df_puf2020$Prscrbr_City[20565445] <- "Mayaguez"

Interestingly, there were a number of non-ascii characters in sections that did not cause an error, and did not require "fixing" for the database to save. Here are those just in case it helps in some way:

# $Prscrbr_Last_Org_Name
bad_ascii_lastname <- c(22057728, 23970569, 23970570, 23970571, 23970572, 23970573)
df_puf2020$Prscrbr_Last_Org_Name[bad_ascii_lastname]
# [1] "Pe�a-Alvarado"   "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez" "Rivera-Monta�ez"

# $Prscrbr_First_Name
bad_ascii_firstname <- c(10254424, 10254425,  10254426,  10254427,  10254428,  10254429,  10254430,  10254431,  10254432,  10254433,  10254434,  
                         10254435,  10254436,  10254437,  10254438,  10254439,  10254440,  10254441,  10254442, 10254443,  10254444,  10254445,  
                         10254446,  10254447,  10254448,  10254449,  10254450,  10254451,  10254452,  10254453,  10254454,  10254455,  10254456, 
                         10254457,  10254458,  10254459,  10254460,  10254461)
df_puf2020$Prscrbr_First_Name[bad_ascii_firstname] <- "Concepcion" # "Concepci�n" 
# [1] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [14] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"
# [27] "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n" "Concepci�n"

@hannes
Copy link
Member

hannes commented Sep 12, 2022

It would seem we need to double-check the UTF verification for R data frame writing

@Stongtong
Copy link

hi, I converted my json data to parquet format file(by java), and using "create table ...as select * from read_parquet()" sql to import data to duck db , meet same exception; my data has 3 field, one of fields is a number type and has null value

@Mytherin
Copy link
Collaborator

Could you share the Parquet file with us (e.g. send to my email at mark@duckdblabs.com)?

@Stongtong
Copy link

sorry, the file data is secret, I write parquet file useing spark fix this exception. thanks

@RobinL
Copy link

RobinL commented Oct 21, 2022

We've had reports of this problem from quite a few users of Splink (a Python library which uses DuckDB as a backend). Although sadly I haven't got a minimal reproducible example (because, being on mac, i've never seen the error myself) a common theme seems to be these errors are occurring on Windows, with data sourced from Microsoft SQL server. I will do my best to try and obtain a reprex

moj-analytical-services/splink#664

@Mytherin
Copy link
Collaborator

How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?

@RobinL
Copy link

RobinL commented Oct 22, 2022

How is your data ingested into DuckDB? Are you inserting through Pandas DataFrames, CSV files, or inserting through prepared statements?

Via Pandas dataframes

@Mytherin
Copy link
Collaborator

It is likely the unicode ingestion verification there is missing or incomplete - we can have a look to see if we can create a reproducible example by intentionally creating pandas data frames with bogus unicode data.

@BigBangData
Copy link

I can confirm that the error I was having is gone, see BigBangData/BookReviews@f212584

@ngould
Copy link

ngould commented Nov 12, 2022

Same here. Thanks!

@ngould
Copy link

ngould commented Nov 12, 2022

Or I should say...now I got a different error. =)

Same query I posted above now causes an OOM exception. Full report here: #5315

@AlexanderVR
Copy link

Seeing similar issue on duckdb 0.6.0:

❯ duckdb data/main.duckdb
v0.6.0 2213f9c946
Enter ".help" for usage hints.
D  create or replace table blah as (with us as (select distinct Address from s.user) select Address from us);
D  create or replace table blah as (with us as (select distinct * from s.user) select Address from us);
Error: INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

Address column is of type:

struct(country varchar, countrycode varchar, city varchar, postalcode varchar, state varchar, statecode varchar, street varchar)

Note:

  • this table has 420,000 rows
  • as shown, it's the distinct in the CTE that triggers the error
  • this is also only in the context of a create table statement.
  • selecting any subfield of Address after the CTE, e.g. select Address.country works fine. Need to select the entire struct to causes the issue.

@Mytherin
Copy link
Collaborator

Mytherin commented Dec 4, 2022

Seeing similar issue on duckdb 0.6.0:

❯ duckdb data/main.duckdb
v0.6.0 2213f9c946
Enter ".help" for usage hints.
D  create or replace table blah as (with us as (select distinct Address from s.user) select Address from us);
D  create or replace table blah as (with us as (select distinct * from s.user) select Address from us);
Error: INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

Address column is of type:

struct(country varchar, countrycode varchar, city varchar, postalcode varchar, state varchar, statecode varchar, street varchar)

Note:

  • this table has 420,000 rows
  • as shown, it's the distinct in the CTE that triggers the error
  • this is also only in the context of a create table statement.
  • selecting any subfield of Address after the CTE, e.g. select Address.country works fine. Need to select the entire struct to causes the issue.

Thanks for the report! Any chance you could share the data with us (could be done privately by e-mail - mark@duckdblabs.com)?

@AlexanderVR
Copy link

Unfortunately I cannot share this data. But perhaps someone has created a data masking/randomization tool for duckdb tables or parquet data? I think for many cases (including this one) it is the data size and shape, not the actual particulars of the values that is causing the issue.

@Mytherin
Copy link
Collaborator

Mytherin commented Dec 5, 2022

Unfortunately I cannot share this data. But perhaps someone has created a data masking/randomization tool for duckdb tables or parquet data? I think for many cases (including this one) it is the data size and shape, not the actual particulars of the values that is causing the issue.

I created a gist here previously that uses the faker library to generate random data. Perhaps that would be an option?

@AlexanderVR
Copy link

Apologies for my delay. DM sent with scrambled data. For this case I had to implement my own script to scramble parquet data, which others might find useful https://gist.github.com/AlexanderVR/d2ed810799be4649446ef0d51364a404

There is some subtlety here, as only changing the row group structure or the nullability masks of the data will cause the issue to disappear.

@mskyttner
Copy link

For those who read this issue, I too ran into it and when working with R, I could clean up a list of dataframes like so:

  my_dataframes |> 
    map(function(x) x |> mutate(across(
      where(is.character), 
      function(x) stringi::stri_encode(x, to = "UTF-8")
    )))

After this the error disappeared.

@shumas
Copy link

shumas commented Feb 16, 2023

any workaround? i'm using appender to insert cyrillic strings, throws exception "Invalid unicode (byte sequence mismatch) detected in segment statistics update"

@Mytherin
Copy link
Collaborator

The data has to be valid UTF8 when inserted into the system. If you are inserting data that is not valid UTF8 then you should convert it prior to inserting.

@shumas
Copy link

shumas commented Feb 16, 2023

I'm sorry, found ugly workaround for .NET DuckDB driver (issue was there)

call "duckdb_append_varchar" and arg byte array instead of the .NET string

[DllImport(DuckDbLibrary, CallingConvention = CallingConvention.Cdecl, EntryPoint = "duckdb_append_varchar")]
public static extern DuckDBState DuckDBAppendVarchar(DuckDBAppender appender, byte[] val);

Encoding.UTF8.GetBytes(someString)

And exception goes away.

Sorry for disturbing :)

@bardware
Copy link

bardware commented Apr 22, 2023

containing various unicode characters, is correctly encoded in utf-8

I just play with the command line on a windows box, get said exception with a file that is Win-1252 encoded. I understand the issue.
It came as a surprise, that a windows program prefers UTF8

I exported a CSV file from online banking.
Header line looks like

Schlusstag / Zahltag;Valuta;Depot-Nr.;Konto-Nr.;Umsatzart;Bemerkung;Nominal / Stück;Bezeichnung;WKN;ISIN;Handelswährung;Kurs;Devisenkurs;Stückzinsen;Kapitalertragssteuer inkl. Solidaritätszuschlag;Ausländische Quellensteuer;Vergütete Steuern;Transaktionspreis inkl. sonstiger Kosten;Eigene Spesen;Fremde Spesen;Ausmachender Betrag

@wrb2
Copy link

wrb2 commented Jun 26, 2023

I am also seeing this error when running queries.

The behavior is pretty strange:

  • I can CTAS a table in DuckDB from Parquet and that works, but when I try to create another table using the data it fails.
  • I traced it to a specific column.
  • It is a technical column that is just numbers and letters and values should be generated by the same algorithm - not free text from users or anything.
  • If I md5() around the column, it still fails.
  • If I hash() around the column, it works.
  • I can "sanitize" the data by using DuckDB to export the data to CSV and reimport it. Then everything is fine. But it's 27 GB so it's not a very good experience.
  • Other tools (I tested Spark, Trino, ClickHouse) don't see any issue with the column.
  • It is a Parquet created by Spark.

As for the query, I'm doing something like

with
test_data as (
    select md5('1') as id, null as parent_id union all
    select md5('2') as id, md5('1') as parent_id union all
    select md5('3') as id, md5('1') as parent_id union all
    select md5('4') as id, md5('3') as parent_id
),

test as (

with recursive parents (id, parent_id, path) as 
(
    select id, parent_id, [id] as path
    from test_data
    where parent_id is null

    union all

    select t.id, t.parent_id, p.path || [t.id] as path
    from test_data t
    join parents p on t.parent_id = p.id

)
select * from parents

)

select * from test

I thought it might be something with the arrays, because other operations with the column seem fine. But I'm not sure.

Any idea what I can investigate to make things work?

Thanks.

@Mytherin
Copy link
Collaborator

Could you share a parquet file containing the problematic column with us? If you can’t publish it publicly feel free to send me an email with the data.

@wrb2
Copy link

wrb2 commented Jun 26, 2023

Sadly I can't - it is confidential.

What kind of digging can I do on my end, that would help?

I'll at least try to chase few versions of the data (we should have Spark written copy and Trino written copy and Delta lake copy that might use different writer) to find out if this is Parquet issue or data issue. The CSV export fixing it is pretty suspicious I feel.

Also I juste realized this was on 0.7 and on 0.8.0 but today I updated to 0.8.1 I'll chcek if I can reproduce it on 0.8.1.

@lnkuiper lnkuiper assigned lnkuiper and unassigned hannes Jun 26, 2023
@lnkuiper
Copy link
Contributor

Hi @wrb2, would it be possible for you to try this using the branch in this PR: #7931?
I have fixed a bug concerning LISTs of VARCHAR there; we suspect this is the issue.

@wrb2
Copy link

wrb2 commented Jun 27, 2023

I have retested it on 0.8.1. I don't know how I got it working last time, this time it never works no matter whether I go through CSV or parquet or whatever.

@lnkuiper I'll go try it.

@wrb2
Copy link

wrb2 commented Jun 27, 2023

@lnkuiper it fixed it!

@lnkuiper
Copy link
Contributor

@wrb2 Thank you for taking the time to test! Glad this is fixed

@wrb2
Copy link

wrb2 commented Jun 27, 2023

Any idea when this might end up in a new release?

@lnkuiper
Copy link
Contributor

lnkuiper commented Jul 5, 2023

@wrb2 If I recall correctly, our next release is scheduled for September.

Closed via #7931

@lnkuiper lnkuiper closed this as completed Jul 5, 2023
@ajdamico
Copy link

hi, i'm still hitting this error on the duckdb_0.8.1-9000 dev version.. should i open a new issue or should this thread be re-opened?

# using github
remotes::install_github("duckdb/duckdb-r")


# minimal reproducible example
library(duckdb)
con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
dbWriteTable( con , 'my_table' , my_df )

console output:

> library(duckdb)
Loading required package: DBI
> con <- dbConnect( duckdb::duckdb() , dbdir = 'my-db.duckdb' )
> my_df <- structure(list(no_municipio_esc = "Est\xe2ncia", no_municipio_prova = "Est\xe2ncia"), row.names = 16L, class = "data.frame")
> dbWriteTable( con , 'my_table' , my_df )
Error: rapi_execute: Failed to run query
Error: Invalid Input Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update
In addition: Warning message:
Database is garbage-collected, use dbDisconnect(con, shutdown=TRUE) or duckdb::duckdb_shutdown(drv) to avoid this. 
> 
> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] duckdb_0.8.1-9000 DBI_1.1.3        

loaded via a namespace (and not attached):
[1] compiler_4.3.1 tools_4.3.1   
> 

@Mytherin
Copy link
Collaborator

Thanks for reporting - could you open a new issue in the new duckdb-r repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests