Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSST string compression failed due to incorrect size calculation #5675

Closed
2 tasks done
RXminuS opened this issue Dec 13, 2022 · 39 comments · Fixed by #6001
Closed
2 tasks done

FSST string compression failed due to incorrect size calculation #5675

RXminuS opened this issue Dec 13, 2022 · 39 comments · Fixed by #6001
Assignees

Comments

@RXminuS
Copy link

RXminuS commented Dec 13, 2022

What happens?

When trying to create a table like this

CREATE TABLE xxx AS SELECT tbl.*, '12345' AS dedup_group
                FROM read_parquet('path/glob/*.snappy.parquet') AS tbl;

I get the following error after a few dozen seconds

InternalException: INTERNAL Error: FSST string compression failed due to incorrect size calculation

To Reproduce

I'm guessing it's somehow dependant on the parquet file that I'm trying to load in but sadly I can't share the data due to privacy reasons. I'm happy to try and generate artificial data that exhibits the same problem but I need some help thinking of ideas what the issue might be so that I don't waste time trying every possible combination.

The parquet files are about 80MB each and were generated from Spark (Scala).

Unfortunately that's the only extra information I have available at the moment but again I'm happy to continue digging.

OS:

macOS

DuckDB Version:

0.6.1, 0.6.2dev447

DuckDB Client:

Python

Full Name:

Rik Nauta

Affiliation:

LMU AB

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@RXminuS RXminuS added the bug label Dec 13, 2022
@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

update: I just saw there's a enable_fsst_vectors setting so tried changing that to True but it hasn't solved the issue.

@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

update: maybe someone more knowledgeable kind think of something in the details of #4366?

@hannes
Copy link
Member

hannes commented Dec 13, 2022

Thanks for the report, but can you please try to make it reproducible by creating a dataset you can share?

@samansmink samansmink self-assigned this Dec 13, 2022
@samansmink
Copy link
Contributor

@RXminuS This is very hard to figure out without any idea of the data. Given that duckdb compresses the data per rowgroup and per column, there should be (at least) 1 offending here. Maybe this specific column is not privacy sensitive? If it is, some statistics would already be helpful: min length, max length, nulls, etc

@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

Yeah I'm happy to try and create a dataset, however I have about 80 columns in there with a bunch of mixed and nested fields so I was trying to see if there were some initial hunches as to what the problem might be so I can try those columns first because by just randomly sampling the data I've not been able to reproduce it.

What also would be really helpful if there was some way of increasing the logs so that I can figure out what the offending row (or even file) is?

@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

@samansmink if I'd loop through every column in the dataset and exclude each one in turn should that be able to isolate the issue then?

@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

Also a hunch...I don't know if that matters is that some of the data is from the internet and so probably messy. What would happen if in a string there are invalid utf-8 code-pairs? Could that upset the count somehow?

@RXminuS
Copy link
Author

RXminuS commented Dec 13, 2022

Thanks for the report, but can you please try to make it reproducible by creating a dataset you can share?

Absolutely! I felt really bad opening the issue with so little information but I also hoped that having at least the error message up here might bring other people who unknowingly are experiencing the same and Googling for it. The message first presented itself through SQLAlchemy / ibis and I've had issues with parquet in the past as well. So since the error message is very generic it took me a while to bring it back to DuckDB instead of one of the other components involved.

But I'll do my best to isolate the datapoint as per suggestions in this thread and really appreciate any help with honing down the issue and patience to get there.

P.S. Also, I'd be remiss if I didn't at least give a massive shoutout to DuckDB...It's quacking awesome! <3

@arjenpdevries
Copy link
Contributor

DuckDB...It's quacking awesome!

Like that :-)

@samansmink
Copy link
Contributor

@RXminuS

@samansmink if I'd loop through every column in the dataset and exclude each one in turn should that be able to isolate the issue then?

I would go the other way around, selecting a single column for every string column in your dataset. Then if it turns out to be one of the nested fields you can do a similar trick with struct_extract where you select only part of the nested type.

Also a hunch...I don't know if that matters is that some of the data is from the internet and so probably messy. What would happen if in a string there are invalid utf-8 code-pairs? Could that upset the count somehow?

This should work I think, FSST also works on BLOB types

@RXminuS
Copy link
Author

RXminuS commented Jan 4, 2023

Just FYI, I'm still investigating. Just been a bit busy around the holidays.

@samansmink
Copy link
Contributor

Hi @RXminuS, have you made any progress with the reproduction of this? I would reaally like to have this one fixed before our next release 😁

If there's any way I can help, let me know!

@RXminuS
Copy link
Author

RXminuS commented Jan 16, 2023

I'm still hunting it down. I've isolated it to a file and a column...but it seems that the row that causes it keeps moving. However...I have a suspicion that #5824 might be related. I'm testing the latest dev branch now to confirm.

@RXminuS
Copy link
Author

RXminuS commented Jan 16, 2023

Nope...issue still remains. What's weird is that sometimes I get a full blown error with a stack trace and sometimes nothing more than
CleanShot 2023-01-17 at 00 52 03@2x

What's weird is that the file loads just fine in Tad Viewer as well. I'm going to keep finding a row that by its own reproduces the error but it's slow going 😢

@RXminuS
Copy link
Author

RXminuS commented Jan 16, 2023

CleanShot 2023-01-17 at 00 55 26@2x

@RXminuS
Copy link
Author

RXminuS commented Jan 17, 2023

Ok, I think I figured out why the rows keep changing...it writes something to the .wal file that then subsequently breaks all following queries. Is there any way that file can be of use?

@Mytherin
Copy link
Collaborator

What's weird is that the file loads just fine in Tad Viewer as well. I'm going to keep finding a row that by its own reproduces the error but it's slow going

It's likely not a row, but an individual column that contains a combination of values that FSST does not handle correctly.

Ok, I think I figured out why the rows keep changing...it writes something to the .wal file that then subsequently breaks all following queries. Is there any way that file can be of use?

That could be, but the WAL file also contains the actual data. The actual data you are loading would be more helpful. You could also send it to us by e-mail - there's no need to publish it publicly.

Alternatively you can try using a scrambling tool such as Faker to scramble the Parquet files, and checking to see if the problem persists?

I've written a script for scrambling Parquet files - and another user wrote this scrambling tool. Perhaps those could be helpful as well.

@RXminuS
Copy link
Author

RXminuS commented Jan 17, 2023

@Mytherin yeah that seems likely now. When I run a LIMIT X on the rows to be inserted then if I go over 122182 I get the error. But if I only select that row by itself it works just fine.

The problem I have is that the data itself is not the only thing that's confidential...it's actually the structure of the data and the data this company I'm working for has available.

I'll see if I can scramble some of the column names and values and still keep the error occurring. Alternatively; I'm looking if it's not easier to just add some logging in DuckDB and output the data that's problematic to a crash dump.

@Mytherin
Copy link
Collaborator

Have you tried trying to isolate which column is causing the issue? DuckDB stores data in columnar format, and columns are compressed individually. It is likely you will be able to replicate this issue by only selecting the individual column, e.g.:

CREATE TABLE xxx AS SELECT tbl.col1
                FROM read_parquet('path/glob/*.snappy.parquet') AS tbl;

If that is the case perhaps you could share the individual column with us? It is possible the individual column does not by itself represent confidential data.

@RXminuS
Copy link
Author

RXminuS commented Jan 17, 2023

Yeah I know which top level column it is, however it's a nested struct so there's a bunch of different sized arrays and stuff in there. I can try selecting only sub-columns and seeing if I can narrow it down further.

@Mytherin
Copy link
Collaborator

Ah, I see. Unnesting the struct might change the storage layout and result in the bug not occuring - but removing parts of the struct should not change the storage layout of the other parts. For example, if your column definition looks like this:

STRUCT(s VARCHAR[], i INT)[];

The s and i columns are stored separately, so turning it into this should not affect storage:

STRUCT(s VARCHAR[])[];

@RXminuS
Copy link
Author

RXminuS commented Jan 18, 2023

I've gotten a hard NO to share the data that's problematic; and have been unsuccessful in replicating the issue with the sensitive fields removed / altered.

I think it's time for a different tactic...

I'm going to try build and write a test locally for me like was done for #5824 . Then I'll post debug information & stack traces. If someone's available I'm happy to do a remote pair-programming session where we try to fix the issue together. Hopefully we can find the underlying root cause that way without needing to share the data itself.

@samansmink
Copy link
Contributor

I've gotten a hard NO to share the data that's problematic; and have been unsuccessful in replicating the issue with the sensitive fields removed / altered.

@RXminuS thats understandable. I'll do some digging into the code tomorrow and also see if I can reproduce the error by bruteforcing a bunch of differently distributed random data through.

If that fails i would certainly be down for a remote pair-programming session, that'll be super helpful for sure!

@samansmink
Copy link
Contributor

@RXminuS I haven't managed to reproduce this with a bunch of random data so I would propose the following:

I made a branch at https://github.com/samansmink/duckdb/tree/instrumented-fsst-compression where I added a bunch of print statements and some extra checks on the relevant variables. Could you rerun your query on the offending column and send me the output? If you have any questions also feel free to also reach out through the duckdb discord or to me directly: 'Sam Ansmink#3611'

If that still fails I think a pair debugging session would be our best bet to catch this

@RXminuS
Copy link
Author

RXminuS commented Jan 20, 2023

Awesome @samansmink ! I was literally close to crying this week because I just kept going around in circles and had so many things going on and feel super bad that I haven't been able to be more concrete issue description up yet or put up some rudimentary PR...and then you just made my day 🙇‍♂️

Sidenote...I'm only desperate to get it working because I absolutely love DuckDB and now my whole team is super excited to start using all the data pipelines and scripts I have made flowing in and out of it.

Anyways, thanks again for adding the instrumentation. I'm going to give this a shot later this weekend. I'll keep you posted

@samansmink
Copy link
Contributor

@RXminuS ah no worries! the log output should give a pretty good idea of whats going wrong, then a fix should not be too difficult i think

@RXminuS
Copy link
Author

RXminuS commented Jan 23, 2023

I've built DuckDB from source using BUILD_PYTHON=1 make debug and if I do a pip uninstall duckdb I indeed see
CleanShot 2023-01-23 at 15 49 10@2x

But when running from python I'm not seeing any additional logs. Do I need to add a make flag or is there a dump saved somewhere?
CleanShot 2023-01-23 at 15 45 51@2x

@Mytherin
Copy link
Collaborator

Perhaps try running pip uninstall multiple times until it returns WARNING: Skipping duckdb as it is not installed. and then building from source? pip tends to keep multiple versions of the system around and it can lead to the wrong version being used by accident.

@samansmink
Copy link
Contributor

@RXminuS ah could you run it with the duckdb cli instead? that would be:

./build/debug/duckdb <some path to where the db will be created>

@RXminuS
Copy link
Author

RXminuS commented Jan 23, 2023

Oh derp! I cloned your repo but forgot to switch branches 🤦 Will rebuild now

@RXminuS
Copy link
Author

RXminuS commented Jan 23, 2023

Geez...the logs are already > 1GB and it's still going (that' just logging "HasEnoughSpace" etc.) How can I best send this?

@samansmink
Copy link
Contributor

@RXminuS ah sorry my bad, I should have mentioned this, im only interested in the tail of this log right before the crash. During FSST compression we need to "fill up" memory blocks. To do this we repeatedly call the HasEnoughSpace method to confirm the data we want to add still fits. During finalize however, the code somehow gets to a different total size than during the compression.

So if you could send me the last 1000 lines or so?
what i need is the last few logs of the HasEnoughSpace and the Finalize logs right before the crash. I'm hoping that there's going to be a discrepancy between those two that shows which variable is corrupted/wrong

@RXminuS
Copy link
Author

RXminuS commented Jan 23, 2023

33=HasEnoughSpace (len: 0, max_len: 4, req_width: 493)
HasEnoughSpace (len: , HasEnoughSpace (len: 00, max_len: , max_len: 1520, req_width: 0, offset_size: 0, , seg_count: req_width: offset_size: 1496428476, , seg_count: space_calc: 39878, space_calc: 1616+2+14964++0+0+0+320=+4822=15004)
HasEnoughSpace (len: )
0HasEnoughSpace (len: 0, max_len: 15, req_width: , max_len: 0, offset_size: 4, 0req_width: 30, , seg_count: 28477, space_calc: offset_size: 16+0+0+0+32=1496448, )
SUMMARY: AddressSanitizer: heap-buffer-overflow bitpacking.cpp:76 in std::__1::enable_if<(((unsigned short)6) + ((unsigned short)30)) >= ((unsigned char)32), void>::type duckdb_fastpforlib::internal::pack_single_in<unsigned int, (unsigned short)6, (unsigned short)30, 63u, (unsigned char)32>(unsigned int, unsigned int*&)
HasEnoughSpace (len: seg_count: 0, max_len: 15, req_width: 398790, , offset_size: space_calc: , offset_size: 0, 0, seg_count: seg_count: 2847887001, space_calc: 16, +space_calc: 0+0+0+16+32=48)
16+HasEnoughSpace (len: 2+14964+0+22=15004)
HasEnoughSpace (len: 00, max_len: 4+00+0+33=49)
, max_len: HasEnoughSpace (len: 0, max_len: 20, req_width: 0, offset_size: 150, seg_count: , req_width: , 3, req_width: offset_size: 014964, , seg_count: offset_size: 0, 8700239880, space_calc: , space_calc: 16seg_count: 28479, space_calc: 16++0+00++00+0+33=49)
+HasEnoughSpace (len: 0, max_len: 20, req_width: 0, 32=16+48offset_size: 0, seg_count: 87003, space_calc: )
HasEnoughSpace (len: 160, max_len: 15+, 2+14964+req_width: 00, +offset_size: 0, 0seg_count: +28480, space_calc: 16+0+0+00++32=480)
HasEnoughSpace (len: 22=15004)
0, max_len: +1533=49)
HasEnoughSpace (len: , 0, max_len: req_width: HasEnoughSpace (len: 0, max_len: 4, req_width: 3, 0, 20, req_width: 0, offset_size: offset_size: 0, seg_count: 0, seg_count: offset_size: 14964, 87004, space_calc: 16+0+0+0+33seg_count: 28481=39881, space_calc: 4916, space_calc: +16+20+14964+0+22=15004)
HasEnoughSpace (len: +)
0, max_len: HasEnoughSpace (len: 40, 0+, max_len: 20, req_width: 0, offset_size: 00+32=48)
HasEnoughSpace (len: , req_width: 0, max_len: 153, offset_size: 14964, seg_count: 39882, , req_width: seg_count: 087005, space_calc: 16+0+0, space_calc: 16+offset_size: +2+00, seg_count: +28482, space_calc: 16+14964+330=+49)
HasEnoughSpace (len: 0, max_len: 20, 0+req_width: 0+22=0+3215004)
=48)
0, offset_size: 0, seg_count: 87006HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 15, 0req_width: , space_calc: 16+, max_len: 400+, 0+0+33=49)
HasEnoughSpace (len: offset_size: 0, max_len: , req_width: 0, seg_count: 28483, space_calc: 16+0320, req_width: +0, , 0+offset_size: offset_size: 00, 14964seg_count: 87007, space_calc: 16, +0+0+seg_count: 0+33=49)
HasEnoughSpace (len: 39883+, space_calc: 0, max_len: 16+20, req_width: 032=, offset_size: 480)
, HasEnoughSpace (len: 0, max_len: 15, seg_count: 2+87008, req_width: 14964+0+22=15004)
HasEnoughSpace (len: 0space_calc: , offset_size: 016, seg_count: +28484, 0space_calc: +160+0++0+0+0+3332==0, max_len: 48)
HasEnoughSpace (len: 490, max_len: 15)
HasEnoughSpace (len: 0, max_len: , req_width: 4, req_width: 0, 203, offset_size: , req_width: offset_size: 149640, seg_count: , 0, 39884offset_size: 0, seg_count: seg_count: 2848587009, , space_calc: space_calc: 1616++0, 0+space_calc: 160+0++330=+490)
+32=48)
Shadow bytes around the buggy address:
  0x10005d2f52b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10005d2f52f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10005d2f5300:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5310: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5320: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5330: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5340: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x10005d2f5350: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
HasEnoughSpace (len: +0, max_len: 15, req_width: 0, 2offset_size: 0, seg_count: +28486, 14964space_calc: +16+00++220=+150040+)
32HasEnoughSpace (len: 0, max_len: 4=, req_width: 3, 48)
HasEnoughSpace (len: 0HasEnoughSpace (len: offset_size: , max_len: 20, req_width: 0, offset_size: 0, seg_count: 87010, space_calc: 16+0+0+149640, max_len: , 15, seg_count: req_width: 0, offset_size: 039885, space_calc: 16+2+14964+0+22, seg_count: 028487=15004)
HasEnoughSpace (len: 0, , max_len: space_calc: 4, req_width: 3, 16offset_size: +14964, 0seg_count: +39886, 0+space_calc: 33=1649)
HasEnoughSpace (len: +00, max_len: 20, +2+14964+32+0+=22=4815004)
HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39887, space_calc: 16+2+14964+0+22=15004)
req_width: )
HasEnoughSpace (len: 0HasEnoughSpace (len: , max_len: 0, max_len: 04, , req_width: offset_size: 3, offset_size: 0, 1496415, seg_count: 39888, space_calc: seg_count: 1687011, space_calc: 16+0++2+14964, 0+0+22=15004)
HasEnoughSpace (len: +00, max_len: +4, 33req_width: 3, offset_size: req_width: =49)
0, 14964HasEnoughSpace (len: 0offset_size: , seg_count: 39889, space_calc: , max_len: 20, req_width: 0, offset_size: 0, 0, seg_count: 16+2+28488, 14964space_calc: 16+0++seg_count: 0087012+22=15004, +0+32=48)
space_calc: HasEnoughSpace (len: 0, max_len: 15, req_width: 16+00, +0offset_size: +00, +seg_count: 33)
HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39890, space_calc: 16+2+14964+0+22=15004=49)
28489)
HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 4, req_width: 3, 0offset_size: 14964, seg_count: , max_len: 3989120, , space_calc: req_width: 016, +offset_size: 2+14964+0+0, seg_count: 87013, 22, =15004)
HasEnoughSpace (len: 0, max_len: 4, space_calc: 16+0+req_width: 3, offset_size: space_calc: 160+0++014964, seg_count: 3239892, space_calc: 16+2++14964+0+22=15004)
0+=48)
HasEnoughSpace (len: 00, max_len: 15, req_width: 0, +HasEnoughSpace (len: 33=49)
HasEnoughSpace (len: 0, max_len: 0, max_len: 20, offset_size: 0, seg_count: 28490, space_calc: 4req_width: , req_width: 16+30, , 0+0+0offset_size: offset_size: 0, seg_count: +1496432=48)
HasEnoughSpace (len: 0, max_len: 15, req_width: 0, 87014, space_calc: 16seg_count: +398930+0+0+33=49)
HasEnoughSpace (len: , , offset_size: space_calc: 0, max_len: 16020, req_width: 0, offset_size: 0, , +2+seg_count: 14964+28491, space_calc: 16+0seg_count: 87015, +space_calc: 220+=0+150040)
16+0+0+0+33=49)
HasEnoughSpace (len: 0, max_len: 20HasEnoughSpace (len: +0, max_len: 4, , req_width: 320, =offset_size: req_width: 3, offset_size: 1496448)
HasEnoughSpace (len: 0, seg_count: 39894, 0, max_len: , seg_count: 15, 87016req_width: space_calc: 0, offset_size: 160, +seg_count: , space_calc: 2+16+149640++0+22=15004028492+0+33, )
HasEnoughSpace (len: =490, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 39895, space_calc: 16+2+14964+0+22=15004)
space_calc: HasEnoughSpace (len: )
16+0, max_len: 04HasEnoughSpace (len: , req_width: 3, offset_size: 14964, seg_count: +00+0+, max_len: 32=20, 48)
HasEnoughSpace (len: req_width: 00, 39896offset_size: , space_calc: 016+, 2+seg_count: , max_len: 87017, space_calc: 16+0+0+0+33=154914964, req_width: 0, offset_size: 0, seg_count: +028493)
, HasEnoughSpace (len: 0, max_len: 20, req_width: +space_calc: 0, offset_size: 2216=15004)
0, seg_count: 87018, space_calc: +HasEnoughSpace (len: 0, max_len: 4, req_width: 3, offset_size: 14964, seg_count: 01639897+0+0+0+, 33space_calc: =16+2+14964+0+0+0+3249)
+22==15004)
48)
HasEnoughSpace (len: HasEnoughSpace (len: HasEnoughSpace (len: 0, max_len: 4, req_width: 0, max_len: 315, offset_size: , 14964req_width: , seg_count: 0, max_len: 20, req_width: 0, 0, offset_size: 0, seg_count: 28494, space_calc: 16+0+0+0+39898offset_size: 0, seg_count: , space_calc: 16+2+14964+0+22=15004)
87019, 32=48)
HasEnoughSpace (len: space_calc: 160, max_len: 15, req_width: 0+, HasEnoughSpace (len: offset_size: 000+0+0+33, max_len: , seg_count: 428495, =space_calc: 4916)
HasEnoughSpace (len: 0, +, max_len: 20, req_width: 0, offset_size: 0+0+00+req_width: 32=3, offset_size: 14964, , seg_count: seg_count: 8702039899, , space_calc: space_calc: 1616++0248+0+0+33=49)
)
HasEnoughSpace (len: +0, max_len: 1514964, req_width: 0, HasEnoughSpace (len: +0, max_len: 20, req_width: 0, offset_size: 0, seg_count: 87021, space_calc: 16+0+0+0+33=0offset_size: 49)
HasEnoughSpace (len: 0, max_len: 20, req_width: 0, offset_size: 00+, 22, seg_count: 87022, space_calc: 16+0+0+seg_count: 0+33=49)
HasEnoughSpace (len: 284960, max_len: =, space_calc: 2015004)
, HasEnoughSpace (len: 0, max_len: 16+0+0+req_width: 0, offset_size: 0, seg_count: 87023, 0+32=48)
HasEnoughSpace (len: 0, max_len: 15, req_width: 0, offset_size: 04space_calc: 16, req_width: +3, 0, +0+0+33seg_count: =49offset_size: 28497, space_calc: 16+0+014964+, seg_count: 39900)
HasEnoughSpace (len: , 0space_calc: 0+16+2+14964+, max_len: 32=48)
0+HasEnoughSpace (len: 0, max_len: 2220, =15004)
HasEnoughSpace (len: 15, 0req_width: 0, max_len: req_width: 4, req_width: 3, 0, , offset_size: 0, seg_count: offset_size: 870240, , seg_count: offset_size: 28498, space_calc: 14964, seg_count: 1639901space_calc: +0, space_calc: 1616+0+0+32=48)
==81109==ABORTING
HasEnoughSpace (len: ++02++014964++0+330=49)
H15EnoughSpace (len: 0, max_len: H15, req_width: 220=, max_len: 20, req_width: 150040)
[1]    81109 abort      ./duckdb ../../../test.duckdb

@RXminuS
Copy link
Author

RXminuS commented Jan 23, 2023

Is that helpful at all? I've saved the entire log (7GB) so let me know if you need me to aggregate / filter / tail something else. At least small win...I've finally been able to provide some actual debug information 🎉 😅

@samansmink
Copy link
Contributor

@RXminuS sent you an email, lets move this discussion there to reduce the noise a bit :)

@samansmink
Copy link
Contributor

samansmink commented Jan 25, 2023

Okay I'm pretty sure I found the bug: During fsst compression the max string length is checked in the max_compressed_string_length variable, but this wasn't reset properly. This means that what can happen is that:

  • A segment with a large max string size is created.
  • a new segment with many empty strings is started, while the current width is reset to 0 as long as incoming strings are empty, the current width is not reset
  • now a non-empty string shorter than the previous non-initialized max is added, causing recalculation of the current width, which is still incorrect

I have a fix https://github.com/samansmink/duckdb/tree/fix-issue-5675 but I still(!) haven't managed to reproduce the bug. @RXminuS would you mind confirming that this fixes the issue? Then I will try to think of a test to reproduce the issue so i can write a test for this

@RXminuS
Copy link
Author

RXminuS commented Jan 25, 2023

Epic!!! Will do; and I'll see if I can get you some test data.

@RXminuS
Copy link
Author

RXminuS commented Jan 25, 2023

🥇 Can confirm, that's working. It also completed a lot more quickly?! I'll try loading in the other data that was giving me the the UTF-8 invalid asserts to see if that's fixed as well.

@RXminuS
Copy link
Author

RXminuS commented Jan 26, 2023

Yes, the UTF-8 issue seems to have disappeared as well. Unfortunately it looks like the column that's problematic is the one containing somewhat sensitive information so I doubt I'm going to get sign-off for supplying it as test-data 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants