Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update #7263

Closed
2 tasks done
manticore-projects opened this issue Apr 26, 2023 · 6 comments · Fixed by #7414
Closed
2 tasks done

Comments

@manticore-projects
Copy link

What happens?

Appender10MRows	1	2.634649
Appender10MRows	2	2.635860
Appender10MRows	3	2.641729
Appender10MRows	4	2.636160
Appender10MRows	5	2.630352
Appender10MRowsPrimaryKey	1	8.375196
Appender10MRowsPrimaryKey	2	terminate called after throwing an instance of 'duckdb::InternalException'
  what():  INTERNAL Error: Invalid unicode (byte sequence mismatch) detected in segment statistics update

To Reproduce

Normal Performance Test, called by

build/release/benchmark/benchmark_runner

OS:

Linux archlinux 6.2.12-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 20 Apr 2023 16:11:55 +0000 x86_64 GNU/Linux

DuckDB Version:

Latest Git

DuckDB Client:

Shell

Full Name:

Andreas Reichel

Affiliation:

manticore-projects Co. Ltd.

Have you tried this on the latest master branch?

  • I agree

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • I agree
@manticore-projects
Copy link
Author

This problem is not reproducible. I ran the test suite 10 times and it occurred just 1 time.
So unless someone has an idea what triggers is, we can close the issue since I won't be able to trace it to the source.

@Mytherin
Copy link
Collaborator

It looks to me like it is a problem in the benchmark itself where the data generator is randomly generating invalid UTF8 and inserting that into the database using the C++ API (avoiding the UTF8 verification until the data reaches the storage).

@manticore-projects
Copy link
Author

Yes, that was my assumption too and so I reported this issue.
However, I can look into this only much later, after fixing some JDBC stuff. We can keep it as a reminder. Can we assign priorities?

@lnkuiper
Copy link
Contributor

lnkuiper commented Apr 26, 2023

I have a hunch about what it could be.

We could try running the benchmark with the build flag DESTROY_UNPINNED_BLOCKS=1 and DISABLE_STRING_INLINE=1 and see if we can reproduce.

Edit: cannot reproduce even with these flags

@rsund
Copy link
Contributor

rsund commented Apr 26, 2023

Reproducible parquet in #5882 (although not benchmark related).

@Tishj
Copy link
Contributor

Tishj commented Apr 26, 2023

I'm gonna take a guess at the issue, from looking into it a bit.
TL;DR the pointer is saved directly and it outlives the scope

We create a stack allocated char area_code[6];, then we create a string_t from this memory
if DUCKDB_DEBUG_NO_INLINE is set, when constructing the string_t, we store this pointer directly.
Then later at:

string_t StringVector::AddStringOrBlob(Vector &vector, string_t data) {
	D_ASSERT(vector.GetType().InternalType() == PhysicalType::VARCHAR);
	if (data.IsInlined()) {
		// string will be inlined: no need to store in string heap
		return data;
	}
	if (!vector.auxiliary) {
		vector.auxiliary = make_buffer<VectorStringBuffer>();
	}
	D_ASSERT(vector.auxiliary->GetBufferType() == VectorBufferType::STRING_BUFFER);
	auto &string_buffer = (VectorStringBuffer &)*vector.auxiliary;
	return string_buffer.AddBlob(data);
}

This pointer is also not allocated, because it's supposed to be inlined, so we store the pointer in the Vector.

But if DUCKDB_DEBUG_NO_INLINE is not set, then I don't see how this can cause a problem

As a side note, shouldn't DUCKDB_DEBUG_NO_INLINE also make IsInlined always return false?

carlopi added a commit to carlopi/duckdb that referenced this issue May 8, 2023
carlopi added a commit to carlopi/duckdb that referenced this issue May 8, 2023
carlopi added a commit to carlopi/duckdb that referenced this issue May 8, 2023
Possible fix to duckdb#7263 (area_code[0] might be uninitialized)
@carlopi carlopi mentioned this issue May 8, 2023
carlopi added a commit to carlopi/duckdb that referenced this issue May 9, 2023
Fix to duckdb#7263 (area_code[0] might be uninitialized otherwise)
carlopi added a commit to carlopi/duckdb that referenced this issue May 9, 2023
Fix to duckdb#7263 (area_code[0] might be uninitialized otherwise)
Mytherin added a commit that referenced this issue May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants