FSST string compression failed due to incorrect size calculation #5675

RXminuS · 2022-12-13T00:59:35Z

What happens?

When trying to create a table like this

CREATE TABLE xxx AS SELECT tbl.*, '12345' AS dedup_group
                FROM read_parquet('path/glob/*.snappy.parquet') AS tbl;

I get the following error after a few dozen seconds

InternalException: INTERNAL Error: FSST string compression failed due to incorrect size calculation

To Reproduce

I'm guessing it's somehow dependant on the parquet file that I'm trying to load in but sadly I can't share the data due to privacy reasons. I'm happy to try and generate artificial data that exhibits the same problem but I need some help thinking of ideas what the issue might be so that I don't waste time trying every possible combination.

The parquet files are about 80MB each and were generated from Spark (Scala).

Unfortunately that's the only extra information I have available at the moment but again I'm happy to continue digging.

OS:

macOS

DuckDB Version:

0.6.1, 0.6.2dev447

DuckDB Client:

Python

Full Name:

Rik Nauta

Affiliation:

LMU AB

Have you tried this on the latest `master` branch?

I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

I agree

The text was updated successfully, but these errors were encountered:

RXminuS · 2022-12-13T01:08:41Z

update: I just saw there's a enable_fsst_vectors setting so tried changing that to True but it hasn't solved the issue.

RXminuS · 2022-12-13T01:13:00Z

update: maybe someone more knowledgeable kind think of something in the details of #4366?

hannes · 2022-12-13T06:41:19Z

Thanks for the report, but can you please try to make it reproducible by creating a dataset you can share?

samansmink · 2022-12-13T13:47:38Z

@RXminuS This is very hard to figure out without any idea of the data. Given that duckdb compresses the data per rowgroup and per column, there should be (at least) 1 offending here. Maybe this specific column is not privacy sensitive? If it is, some statistics would already be helpful: min length, max length, nulls, etc

RXminuS · 2022-12-13T14:52:43Z

Yeah I'm happy to try and create a dataset, however I have about 80 columns in there with a bunch of mixed and nested fields so I was trying to see if there were some initial hunches as to what the problem might be so I can try those columns first because by just randomly sampling the data I've not been able to reproduce it.

What also would be really helpful if there was some way of increasing the logs so that I can figure out what the offending row (or even file) is?

RXminuS · 2022-12-13T14:53:52Z

@samansmink if I'd loop through every column in the dataset and exclude each one in turn should that be able to isolate the issue then?

RXminuS · 2022-12-13T14:55:58Z

Also a hunch...I don't know if that matters is that some of the data is from the internet and so probably messy. What would happen if in a string there are invalid utf-8 code-pairs? Could that upset the count somehow?

RXminuS · 2022-12-13T15:01:27Z

Thanks for the report, but can you please try to make it reproducible by creating a dataset you can share?

Absolutely! I felt really bad opening the issue with so little information but I also hoped that having at least the error message up here might bring other people who unknowingly are experiencing the same and Googling for it. The message first presented itself through SQLAlchemy / ibis and I've had issues with parquet in the past as well. So since the error message is very generic it took me a while to bring it back to DuckDB instead of one of the other components involved.

But I'll do my best to isolate the datapoint as per suggestions in this thread and really appreciate any help with honing down the issue and patience to get there.

P.S. Also, I'd be remiss if I didn't at least give a massive shoutout to DuckDB...It's quacking awesome! <3

arjenpdevries · 2022-12-13T19:30:20Z

DuckDB...It's quacking awesome!

Like that :-)

samansmink · 2022-12-15T12:17:41Z

@RXminuS

@samansmink if I'd loop through every column in the dataset and exclude each one in turn should that be able to isolate the issue then?

I would go the other way around, selecting a single column for every string column in your dataset. Then if it turns out to be one of the nested fields you can do a similar trick with struct_extract where you select only part of the nested type.

Also a hunch...I don't know if that matters is that some of the data is from the internet and so probably messy. What would happen if in a string there are invalid utf-8 code-pairs? Could that upset the count somehow?

This should work I think, FSST also works on BLOB types

RXminuS · 2023-01-04T12:32:41Z

Just FYI, I'm still investigating. Just been a bit busy around the holidays.

samansmink · 2023-01-13T09:47:27Z

Hi @RXminuS, have you made any progress with the reproduction of this? I would reaally like to have this one fixed before our next release 😁

If there's any way I can help, let me know!

RXminuS · 2023-01-16T23:48:58Z

I'm still hunting it down. I've isolated it to a file and a column...but it seems that the row that causes it keeps moving. However...I have a suspicion that #5824 might be related. I'm testing the latest dev branch now to confirm.

RXminuS · 2023-01-16T23:54:18Z

Nope...issue still remains. What's weird is that sometimes I get a full blown error with a stack trace and sometimes nothing more than

What's weird is that the file loads just fine in Tad Viewer as well. I'm going to keep finding a row that by its own reproduces the error but it's slow going 😢

RXminuS · 2023-01-16T23:55:43Z

RXminuS · 2023-01-17T00:12:05Z

Ok, I think I figured out why the rows keep changing...it writes something to the .wal file that then subsequently breaks all following queries. Is there any way that file can be of use?

Mytherin · 2023-01-17T08:38:03Z

What's weird is that the file loads just fine in Tad Viewer as well. I'm going to keep finding a row that by its own reproduces the error but it's slow going

It's likely not a row, but an individual column that contains a combination of values that FSST does not handle correctly.

Ok, I think I figured out why the rows keep changing...it writes something to the .wal file that then subsequently breaks all following queries. Is there any way that file can be of use?

That could be, but the WAL file also contains the actual data. The actual data you are loading would be more helpful. You could also send it to us by e-mail - there's no need to publish it publicly.

Alternatively you can try using a scrambling tool such as Faker to scramble the Parquet files, and checking to see if the problem persists?

I've written a script for scrambling Parquet files - and another user wrote this scrambling tool. Perhaps those could be helpful as well.

RXminuS · 2023-01-17T10:14:09Z

@Mytherin yeah that seems likely now. When I run a LIMIT X on the rows to be inserted then if I go over 122182 I get the error. But if I only select that row by itself it works just fine.

The problem I have is that the data itself is not the only thing that's confidential...it's actually the structure of the data and the data this company I'm working for has available.

I'll see if I can scramble some of the column names and values and still keep the error occurring. Alternatively; I'm looking if it's not easier to just add some logging in DuckDB and output the data that's problematic to a crash dump.

Mytherin · 2023-01-17T10:41:50Z

Have you tried trying to isolate which column is causing the issue? DuckDB stores data in columnar format, and columns are compressed individually. It is likely you will be able to replicate this issue by only selecting the individual column, e.g.:

CREATE TABLE xxx AS SELECT tbl.col1
                FROM read_parquet('path/glob/*.snappy.parquet') AS tbl;

If that is the case perhaps you could share the individual column with us? It is possible the individual column does not by itself represent confidential data.

RXminuS · 2023-01-17T14:56:43Z

Yeah I know which top level column it is, however it's a nested struct so there's a bunch of different sized arrays and stuff in there. I can try selecting only sub-columns and seeing if I can narrow it down further.

Mytherin · 2023-01-17T14:59:59Z

Ah, I see. Unnesting the struct might change the storage layout and result in the bug not occuring - but removing parts of the struct should not change the storage layout of the other parts. For example, if your column definition looks like this:

STRUCT(s VARCHAR[], i INT)[];

The s and i columns are stored separately, so turning it into this should not affect storage:

STRUCT(s VARCHAR[])[];

RXminuS · 2023-01-18T15:30:47Z

I've gotten a hard NO to share the data that's problematic; and have been unsuccessful in replicating the issue with the sensitive fields removed / altered.

I think it's time for a different tactic...

I'm going to try build and write a test locally for me like was done for #5824 . Then I'll post debug information & stack traces. If someone's available I'm happy to do a remote pair-programming session where we try to fix the issue together. Hopefully we can find the underlying root cause that way without needing to share the data itself.

samansmink · 2023-01-18T15:45:40Z

I've gotten a hard NO to share the data that's problematic; and have been unsuccessful in replicating the issue with the sensitive fields removed / altered.

@RXminuS thats understandable. I'll do some digging into the code tomorrow and also see if I can reproduce the error by bruteforcing a bunch of differently distributed random data through.

If that fails i would certainly be down for a remote pair-programming session, that'll be super helpful for sure!

samansmink · 2023-01-20T13:01:48Z

@RXminuS I haven't managed to reproduce this with a bunch of random data so I would propose the following:

I made a branch at https://github.com/samansmink/duckdb/tree/instrumented-fsst-compression where I added a bunch of print statements and some extra checks on the relevant variables. Could you rerun your query on the offending column and send me the output? If you have any questions also feel free to also reach out through the duckdb discord or to me directly: 'Sam Ansmink#3611'

If that still fails I think a pair debugging session would be our best bet to catch this

RXminuS · 2023-01-20T22:28:56Z

Awesome @samansmink ! I was literally close to crying this week because I just kept going around in circles and had so many things going on and feel super bad that I haven't been able to be more concrete issue description up yet or put up some rudimentary PR...and then you just made my day 🙇‍♂️

Sidenote...I'm only desperate to get it working because I absolutely love DuckDB and now my whole team is super excited to start using all the data pipelines and scripts I have made flowing in and out of it.

Anyways, thanks again for adding the instrumentation. I'm going to give this a shot later this weekend. I'll keep you posted

samansmink · 2023-01-23T10:14:34Z

@RXminuS ah no worries! the log output should give a pretty good idea of whats going wrong, then a fix should not be too difficult i think

RXminuS · 2023-01-23T14:50:10Z

I've built DuckDB from source using BUILD_PYTHON=1 make debug and if I do a pip uninstall duckdb I indeed see

But when running from python I'm not seeing any additional logs. Do I need to add a make flag or is there a dump saved somewhere?

Mytherin · 2023-01-23T14:53:13Z

Perhaps try running pip uninstall multiple times until it returns WARNING: Skipping duckdb as it is not installed. and then building from source? pip tends to keep multiple versions of the system around and it can lead to the wrong version being used by accident.

samansmink · 2023-01-23T14:53:31Z

@RXminuS ah could you run it with the duckdb cli instead? that would be:

./build/debug/duckdb <some path to where the db will be created>

RXminuS · 2023-01-23T15:16:33Z

Oh derp! I cloned your repo but forgot to switch branches 🤦 Will rebuild now

RXminuS · 2023-01-23T16:20:22Z

Geez...the logs are already > 1GB and it's still going (that' just logging "HasEnoughSpace" etc.) How can I best send this?

samansmink · 2023-01-23T16:55:26Z

@RXminuS ah sorry my bad, I should have mentioned this, im only interested in the tail of this log right before the crash. During FSST compression we need to "fill up" memory blocks. To do this we repeatedly call the HasEnoughSpace method to confirm the data we want to add still fits. During finalize however, the code somehow gets to a different total size than during the compression.

So if you could send me the last 1000 lines or so?
what i need is the last few logs of the HasEnoughSpace and the Finalize logs right before the crash. I'm hoping that there's going to be a discrepancy between those two that shows which variable is corrupted/wrong

RXminuS · 2023-01-23T19:25:31Z

33=HasEnoughSpace (len: 0, max_len: 4, req_width: 493)
HasEnoughSpace (len: , HasEnoughSpace (len: 00, max_len: , max_len: 1520, req_width: 0, offset_size: 0, , seg_count: req_width: offset_size: 1496428476, , seg_count: space_calc: 39878, space_calc: 1616+2+14964++0+0+0+320=+4822=15004)
HasEnoughSpace (len: )
0HasEnoughSpace (len: 0, max_len: 15, req_width: , max_len: 0, offset_size: 4, 0req_width: 30, , seg_count: 28477, space_calc: offset_size: 16+0+0+0+32=1496448, )
SUMMARY: AddressSanitizer: heap-buffer-overflow bitpacking.cpp:76 in std::__1::enable_if<(((unsigned short)6) + ((unsigned short)30)) >= ((unsigned char)32), void>::type duckdb_fastpforlib::internal::pack_single_in<unsigned int, (unsigned short)6, (unsigned short)30, 63u, (unsigned char)32>(unsigned int, unsigned int*&)
HasEnoughSpace (len: seg_count: 0, max_len: 15, req_width: 398790, , offset_size: space_calc: , offset_size: 0, 0, seg_count: seg_count: 2847887001, space_calc: 16, +space_calc: 0+0+0+16+32=48)
16+HasEnoughSpace (len: 2+14964+0+22=15004)
HasEnoughSpace (len: 00, max_len: 4+00+0+33=49)
, max_len: HasEnoughSpace (len: 0, max_len: 20, req_width: 0, offset_size: 150, seg_count: , req_width: , 3, req_width: offset_size: 014964, , seg_count: offset_size: 0, 8700239880, space_calc: , space_calc: 16seg_count: 28479, space_calc: 16++0+00++00+0+33=49)
+HasEnoughSpace (len: 0, max_len: 20, req_width: 0, 32=16+48offset_size: 0, seg_count: 87003, space_calc: )
HasEnoughSpace (len: 160, max_len: 15+, 2+14964+req_width: 00, +offset_size: 0, 0seg_count: +28480, space_calc: 16+0+0+00++32=480)
HasEnoughSpace (len: 22=15004)
0, max_len: +1533=49)
HasEnoughSpace (len: , 0, max_len: req_width: HasEnoughSpace (len: 0, max_len: 4, req_width: 3, 0, 20, req_width: 0, offset_size: offset_size: 0, seg_count: 0, seg_count: offset_size: 14964, 87004, space_calc: 16+0+0+0+33seg_count: 28481=39881, space_calc: 4916, space_calc: +16+20+14964+0+22=15004)
HasEnoughSpace (len: +)
0, max_len: HasEnoughSpace (len: 40, 0+, max_len: 20, req_width: 0, offset_size: 00+32=48)
HasEnoughSpace (len: , req_width: 0, max_len: 153, offset_size: 14964, seg_count: 39882, , req_width: seg_count: 087005, space_calc: 16+0+0, space_calc: 16+offset_size: +2+00, seg_count: +28482, space_calc: 16+14964+330=+49)
HasEnoughSpace (len: 0, max_len: 20, 0+req_width: 0+22=0+3215004)
=48)
0, offset_size: 0, seg_count: 87006HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 15, 0req_width: , space_calc: 16+, max_len: 400+, 0+0+33=49)
HasEnoughSpace (len: offset_size: 0, max_len: , req_width: 0, seg_count: 28483, space_calc: 16+0320, req_width: +0, , 0+offset_size: offset_size: 00, 14964seg_count: 87007, space_calc: 16, +0+0+seg_count: 0+33=49)
HasEnoughSpace (len: 39883+, space_calc: 0, max_len: 16+20, req_width: 032=, offset_size: 480)
, HasEnoughSpace (len: 0, max_len: 15, seg_count: 2+87008, req_width: 14964+0+22=15004)
HasEnoughSpace (len: 0space_calc: , offset_size: 016, seg_count: +28484, 0space_calc: +160+0++0+0+0+3332==0, max_len: 48)
HasEnoughSpace (len: 490, max_len: 15)
HasEnoughSpace (len: 0, max_len: , req_width: 4, req_width: 0, 203, offset_size: , req_width: offset_size: 149640, seg_count: , 0, 39884offset_size: 0, seg_count: seg_count: 2848587009, , space_calc: space_calc: 1616++0, 0+space_calc: 160+0++330=+490)
+32=48)
Shadow bytes around the buggy address:
  0x10005d2f52b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10005d2f5300:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5310: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5320: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5330: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5340: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5350: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
HasEnoughSpace (len: +0, max_len: 15, req_width: 0, 2offset_size: 0, seg_count: +28486, 14964space_calc: +16+00++220=+150040+)
32HasEnoughSpace (len: 0, max_len: 4=, req_width: 3, 48)
HasEnoughSpace (len: 0HasEnoughSpace (len: offset_size: , max_len: 20, req_width: 0, offset_size: 0, seg_count: 87010, space_calc: 16+0+0+149640, max_len: , 15, seg_count: req_width: 0, offset_size: 039885, space_calc: 16+2+14964+0+22, seg_count: 028487=15004)
HasEnoughSpace (len: 0, , max_len: space_calc: 4, req_width: 3, 16offset_size: +14964, 0seg_count: +39886, 0+space_calc: 33=1649)
HasEnoughSpace (len: +00, max_len: 20, +2+14964+32+0+=22=4815004)
HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39887, space_calc: 16+2+14964+0+22=15004)
req_width: )
HasEnoughSpace (len: 0HasEnoughSpace (len: , max_len: 0, max_len: 04, , req_width: offset_size: 3, offset_size: 0, 1496415, seg_count: 39888, space_calc: seg_count: 1687011, space_calc: 16+0++2+14964, 0+0+22=15004)
HasEnoughSpace (len: +00, max_len: +4, 33req_width: 3, offset_size: req_width: =49)
0, 14964HasEnoughSpace (len: 0offset_size: , seg_count: 39889, space_calc: , max_len: 20, req_width: 0, offset_size: 0, 0, seg_count: 16+2+28488, 14964space_calc: 16+0++seg_count: 0087012+22=15004, +0+32=48)
space_calc: HasEnoughSpace (len: 0, max_len: 15, req_width: 16+00, +0offset_size: +00, +seg_count: 33)
HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39890, space_calc: 16+2+14964+0+22=15004=49)
28489)
HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 4, req_width: 3, 0offset_size: 14964, seg_count: , max_len: 3989120, , space_calc: req_width: 016, +offset_size: 2+14964+0+0, seg_count: 87013, 22, =15004)
HasEnoughSpace (len: 0, max_len: 4, space_calc: 16+0+req_width: 3, offset_size: space_calc: 160+0++014964, seg_count: 3239892, space_calc: 16+2++14964+0+22=15004)
0+=48)
HasEnoughSpace (len: 00, max_len: 15, req_width: 0, +HasEnoughSpace (len: 33=49)
HasEnoughSpace (len: 0, max_len: 0, max_len: 20, offset_size: 0, seg_count: 28490, space_calc: 4req_width: , req_width: 16+30, , 0+0+0offset_size: offset_size: 0, seg_count: +1496432=48)
HasEnoughSpace (len: 0, max_len: 15, req_width: 0, 87014, space_calc: 16seg_count: +398930+0+0+33=49)
HasEnoughSpace (len: , , offset_size: space_calc: 0, max_len: 16020, req_width: 0, offset_size: 0, , +2+seg_count: 14964+28491, space_calc: 16+0seg_count: 87015, +space_calc: 220+=0+150040)
16+0+0+0+33=49)
HasEnoughSpace (len: 0, max_len: 20HasEnoughSpace (len: +0, max_len: 4, , req_width: 320, =offset_size: req_width: 3, offset_size: 1496448)
HasEnoughSpace (len: 0, seg_count: 39894, 0, max_len: , seg_count: 15, 87016req_width: space_calc: 0, offset_size: 160, +seg_count: , space_calc: 2+16+149640++0+22=15004028492+0+33, )
HasEnoughSpace (len: =490, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39895, space_calc: 16+2+14964+0+22=15004)
space_calc: HasEnoughSpace (len: )
16+0, max_len: 04HasEnoughSpace (len: , req_width: 3, offset_size: 14964, seg_count: +00+0+, max_len: 32=20, 48)
HasEnoughSpace (len: req_width: 00, 39896offset_size: , space_calc: 016+, 2+seg_count: , max_len: 87017, space_calc: 16+0+0+0+33=154914964, req_width: 0, offset_size: 0, seg_count: +028493)
, HasEnoughSpace (len: 0, max_len: 20, req_width: +space_calc: 0, offset_size: 2216=15004)
0, seg_count: 87018, space_calc: +HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 01639897+0+0+0+, 33space_calc: =16+2+14964+0+0+0+3249)
+22==15004)
48)
HasEnoughSpace (len: HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 4, req_width: 0, max_len: 315, offset_size: , 14964req_width: , seg_count: 0, max_len: 20, req_width: 0, 0, offset_size: 0, seg_count: 28494, space_calc: 16+0+0+0+39898offset_size: 0, seg_count: , space_calc: 16+2+14964+0+22=15004)
87019, 32=48)
HasEnoughSpace (len: space_calc: 160, max_len: 15, req_width: 0+, HasEnoughSpace (len: offset_size: 000+0+0+33, max_len: , seg_count: 428495, =space_calc: 4916)
HasEnoughSpace (len: 0, +, max_len: 20, req_width: 0, offset_size: 0+0+00+req_width: 32=3, offset_size: 14964, , seg_count: seg_count: 8702039899, , space_calc: space_calc: 1616++0248+0+0+33=49)
)
HasEnoughSpace (len: +0, max_len: 1514964, req_width: 0, HasEnoughSpace (len: +0, max_len: 20, req_width: 0, offset_size: 0, seg_count: 87021, space_calc: 16+0+0+0+33=0offset_size: 49)
HasEnoughSpace (len: 0, max_len: 20, req_width: 0, offset_size: 00+, 22, seg_count: 87022, space_calc: 16+0+0+seg_count: 0+33=49)
HasEnoughSpace (len: 284960, max_len: =, space_calc: 2015004)
, HasEnoughSpace (len: 0, max_len: 16+0+0+req_width: 0, offset_size: 0, seg_count: 87023, 0+32=48)
HasEnoughSpace (len: 0, max_len: 15, req_width: 0, offset_size: 04space_calc: 16, req_width: +3, 0, +0+0+33seg_count: =49offset_size: 28497, space_calc: 16+0+014964+, seg_count: 39900)
HasEnoughSpace (len: , 0space_calc: 0+16+2+14964+, max_len: 32=48)
0+HasEnoughSpace (len: 0, max_len: 2220, =15004)
HasEnoughSpace (len: 15, 0req_width: 0, max_len: req_width: 4, req_width: 3, 0, , offset_size: 0, seg_count: offset_size: 870240, , seg_count: offset_size: 28498, space_calc: 14964, seg_count: 1639901space_calc: +0, space_calc: 1616+0+0+32=48)
==81109==ABORTING
HasEnoughSpace (len: ++02++014964++0+330=49)
H15EnoughSpace (len: 0, max_len: H15, req_width: 220=, max_len: 20, req_width: 150040)
[1]    81109 abort      ./duckdb ../../../test.duckdb

RXminuS · 2023-01-23T19:27:42Z

Is that helpful at all? I've saved the entire log (7GB) so let me know if you need me to aggregate / filter / tail something else. At least small win...I've finally been able to provide some actual debug information 🎉 😅

samansmink · 2023-01-24T11:32:41Z

@RXminuS sent you an email, lets move this discussion there to reduce the noise a bit :)

samansmink · 2023-01-25T19:07:41Z

Okay I'm pretty sure I found the bug: During fsst compression the max string length is checked in the max_compressed_string_length variable, but this wasn't reset properly. This means that what can happen is that:

A segment with a large max string size is created.
a new segment with many empty strings is started, while the current width is reset to 0 as long as incoming strings are empty, the current width is not reset
now a non-empty string shorter than the previous non-initialized max is added, causing recalculation of the current width, which is still incorrect

I have a fix https://github.com/samansmink/duckdb/tree/fix-issue-5675 but I still(!) haven't managed to reproduce the bug. @RXminuS would you mind confirming that this fixes the issue? Then I will try to think of a test to reproduce the issue so i can write a test for this

RXminuS · 2023-01-25T19:13:31Z

Epic!!! Will do; and I'll see if I can get you some test data.

RXminuS · 2023-01-25T21:27:30Z

🥇 Can confirm, that's working. It also completed a lot more quickly?! I'll try loading in the other data that was giving me the the UTF-8 invalid asserts to see if that's fixed as well.

RXminuS · 2023-01-26T09:47:22Z

Yes, the UTF-8 issue seems to have disappeared as well. Unfortunately it looks like the column that's problematic is the one containing somewhat sensitive information so I doubt I'm going to get sign-off for supplying it as test-data 😕

RXminuS added the bug label Dec 13, 2022

samansmink self-assigned this Dec 13, 2022

samansmink mentioned this issue Jan 26, 2023

Fix issue 5675 #6001

Merged

Mytherin closed this as completed in #6001 Jan 27, 2023

FSST string compression failed due to incorrect size calculation #5675

FSST string compression failed due to incorrect size calculation #5675

Comments

RXminuS commented Dec 13, 2022 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

RXminuS commented Dec 13, 2022

RXminuS commented Dec 13, 2022

hannes commented Dec 13, 2022 • edited

samansmink commented Dec 13, 2022

RXminuS commented Dec 13, 2022

RXminuS commented Dec 13, 2022

RXminuS commented Dec 13, 2022

RXminuS commented Dec 13, 2022

arjenpdevries commented Dec 13, 2022

samansmink commented Dec 15, 2022

RXminuS commented Jan 4, 2023

samansmink commented Jan 13, 2023

RXminuS commented Jan 16, 2023

RXminuS commented Jan 16, 2023

RXminuS commented Jan 16, 2023

RXminuS commented Jan 17, 2023

Mytherin commented Jan 17, 2023

RXminuS commented Jan 17, 2023

Mytherin commented Jan 17, 2023

RXminuS commented Jan 17, 2023

Mytherin commented Jan 17, 2023

RXminuS commented Jan 18, 2023

samansmink commented Jan 18, 2023

samansmink commented Jan 20, 2023

RXminuS commented Jan 20, 2023

samansmink commented Jan 23, 2023

RXminuS commented Jan 23, 2023

Mytherin commented Jan 23, 2023

samansmink commented Jan 23, 2023

RXminuS commented Jan 23, 2023

RXminuS commented Jan 23, 2023

samansmink commented Jan 23, 2023

RXminuS commented Jan 23, 2023

RXminuS commented Jan 23, 2023 • edited

samansmink commented Jan 24, 2023

samansmink commented Jan 25, 2023 • edited

RXminuS commented Jan 25, 2023

RXminuS commented Jan 25, 2023

RXminuS commented Jan 26, 2023

RXminuS commented Dec 13, 2022 •

edited

Have you tried this on the latest `master` branch?

hannes commented Dec 13, 2022 •

edited

RXminuS commented Jan 23, 2023 •

edited

samansmink commented Jan 25, 2023 •

edited