INTERNAL Error: Attempted to access index N within vector of size N #10950

davidcorcoran · 2024-03-01T19:46:56Z

What happens?

Running the given query causes INTERNAL Error: Attempted to access index 3 within vector of size 3 after which point the database will produce the error 'Error: FATAL Error: Failed: database has been invalidated because of a previous fatal error. The database must be restarted prior to being used again.'

To Reproduce

Run the following querying.

SELECT t1.Transaction_ID
FROM transactions t1
WHERE t1.Transaction_ID IN
    (SELECT t2.Referred_Transaction_ID
     FROM transactions t2
     WHERE t2.Transaction_ID IN (123606, 123602, 131522, 123604, 131470)
         AND t2.Transaction_ID NOT IN (SELECT t2_filter.Transaction_ID FROM transactions t2_filter))

That query is a very stripped down version of the query which caused the error for me. So although it looks odd, and pointless, it still produces the INTERNAL ERROR.
Use the following database:
bug.db.zip

Just to note something odd: If you export as CSV from that database, then reimport into a new database, the query will succeed.

OS:

x64, aarch64

DuckDB Version:

0.9.2, 0.10.0, nightly on 2024-03-01

DuckDB Client:

Java, CLI

Full Name:

David Corcoran

Affiliation:

Topaz (https://topaz.technology/)
We use DuckDB as an in memory pivoting database. We migrated from originally using MonetDB for the same job.

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

The text was updated successfully, but these errors were encountered:

carlopi · 2024-03-01T20:01:29Z

Can you edit the repro to:

SELECT t1.Transaction_ID
FROM transactions t1
WHERE t1.Transaction_ID IN
      (SELECT t2.Referred_Transaction_ID
       FROM transactions t2
       WHERE t2.Transaction_ID IN (123606, 123602, 131522, 123604, 131470)
         AND t2.Transaction_ID NOT IN (SELECT t2_filter.Transaction_ID FROM transactions t2_filter))

Thanks for the report

sirfz · 2024-03-04T17:41:25Z

I get this error randomly when querying CSV files (using read_csv_auto). Apologies for the lack of reproduction code/data but is it possible this happens when running aggregation queries (e.g. select count()) with read_csv_auto? Right now I consistently hit the error by running this query:

conn.query("select count() from read_csv_auto('/path/to/pairs*_data.csv.gz', header=true, filename = true)").show()

The pattern matches two files.

Although I can load the files without issue:

dp = conn.query("select * from read_csv_auto('/path/to/pairs*_data.csv.gz', header=true, filename = true)").arrow()

davidcorcoran · 2024-03-21T18:07:36Z

We rewrote our query to use an IN instead of the correlated subquery and it's been working fine for a few weeks.

tboddyspargo · 2024-03-26T00:28:43Z

I ran into INTERNAL Error: Attempted to access index 62 within vector of size 62 when loading a ~100MiB CSV file that had some type diversity and long string values, but then a redundant header row was inserted in the middle of the file (i.e. someone had just concatenated the rows of two CSV files without removing the header row from the second file). We useheader=true, names, null_padding , and ignore_errors (didn't help in this case). I've only observed this error on linux/amd64 platform, although I'm not sure that's necessarily relevant.

tboddyspargo · 2024-03-26T19:56:11Z

UPDATE: I have a reproducer for python! Hopefully this helps to identify a root cause!

This reproducer demonstrates one flavor of this issue on both v0.9.2 and v0.10.1 as well as the latest pre-release version (v0.10.2.dev213).

conn = duckdb.connect()
rel: duckdb.DuckDBPyRelation = conn.read_csv(
    "/tmp/youtube_videos_sm.csv",
    header=True,
    null_padding=True,
)
rel.show()

youtube_videos_sm.csv

Result:

duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63

Shrinking the file further (by removing rows) seems to cause a different CSV dialect to be sniffed (i.e different delimiter or quote character or escape character), and the InternalException error no longer occurs. However, at that point, I can't seem to correct the dialect options manually... Providing the ones I want just seems to provoke: duckdb.duckdb.InvalidInputException: Invalid Input Error: Error in file "[...]": CSV options could not be auto-detected. Consider setting parser options manually..

Tmonster · 2024-04-09T12:39:32Z

UPDATE: I have a reproducer for python! Hopefully this helps to identify a root cause!

This reproducer demonstrates one flavor of this issue on both v0.9.2 and v0.10.1 as well as the latest pre-release version (v0.10.2.dev213).
def test_csv_load():
    conn = duckdb.connect()
    rel: duckdb.DuckDBPyRelation = conn.read_csv(
        "/tmp/youtube_videos_sm.csv",
        header=True,
        null_padding=True,
    )
    rel.show()
youtube_videos_sm.csv

Result:
duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63
Shrinking the file further (by removing rows) seems to cause a different CSV dialect to be sniffed (i.e different delimiter or quote character or escape character), and the InternalException error no longer occurs. However, at that point, I can't seem to correct the dialect options manually... Providing the ones I want just seems to provoke: duckdb.duckdb.InvalidInputException: Invalid Input Error: Error in file "[...]": CSV options could not be auto-detected. Consider setting parser options manually..

I @tboddyspargo I've managed to fix the issue the author had, but could not manage to reproduce your issue. I also looked at your csv file and there seem to be 2 types of separators, both "," and "|". I think that is why you get the InvalidInputException. Can you try specifying a delimiter in your options? If the problem persists, can you also try with a fresh install of duckdb? That way it is more likely that I can reproduce the issue.

tboddyspargo · 2024-04-09T16:49:39Z

I also looked at your csv file and there seem to be 2 types of separators, both "," and "|".

You're right! I didn't notice it before, but there's inconsistency with quoting and escaping quotes in the tags column. That would explain why the sniffer was getting confused depending on how I truncated the file. It seems to have been sniffing \0 quote and escape characters whenever it sampled among those inconsistent rows. With null_padding=true, that resulted in a bunch of extra columns from the , characters in the rows with long description values. With null_padding=false, it usually just truncated any of those extra columns that didn't align with the header row.

I @tboddyspargo I've managed to fix the issue the author had, but could not manage to reproduce your issue.

It seems I may have too aggressively truncated the file without re-testing on other versions. Here are the results of my repro with youtube_videos_sm.csv:

>>> print(duckdb.__version__)
0.9.2
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63

>>> print(duckdb.__version__)
0.10.1
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "/Users/tyler/Downloads/youtube_videos_sm_broken.csv".
CSV options could not be auto-detected. Consider setting parser options manually.

>>> print(duckdb.__version__)
0.10.2-dev311
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "/Users/tyler/Downloads/youtube_videos_sm_broken.csv".
CSV options could not be auto-detected. Consider setting parser options manually.

With the issue (presumably in the sniffer) addressed after 0.9.2, I think the Invalid Input Error is appropriate given the formatting issues with the file. I don't think I have any remaining reproducible behavior that I'm concerned about. If I'm able to identify concerning scenarios in the future, I'll raise separate issues. Thank you for checking back in with me!

mdp1125 · 2024-04-09T20:40:31Z

hey @tboddyspargo I just encountered a similar case Original error: "Attempted to access index 1 within vector of size 1"", possibly due to long strings and formatting. I seem to be blocked on this, do you know what I should do to the file to resolve this? Any help would be really appreciated, thank you

tboddyspargo · 2024-04-09T20:43:44Z

Based on this issue, you may just want to try with the latest pre-release version to start with. If that doesn't address the issue, then sharing a reproducer that fails on the latest pre-release would help the maintainers to effectively triage and investigate.

davidcorcoran added the needs triage label Mar 1, 2024

carlopi added the reproduced label Mar 1, 2024

duckdblabs-bot removed the needs triage label Mar 1, 2024

Tmonster mentioned this issue Apr 9, 2024

No mark join conversation in statistics propagation Tmonster/duckdb#149

Merged

Tmonster mentioned this issue Apr 10, 2024

No Mark to Semi join conversion in statistics propagation #11596

Merged

Mytherin closed this as completed in #11596 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

INTERNAL Error: Attempted to access index N within vector of size N #10950

INTERNAL Error: Attempted to access index N within vector of size N #10950

davidcorcoran commented Mar 1, 2024 •

edited

Loading

carlopi commented Mar 1, 2024

sirfz commented Mar 4, 2024

davidcorcoran commented Mar 21, 2024

tboddyspargo commented Mar 26, 2024

tboddyspargo commented Mar 26, 2024 •

edited

Loading

Tmonster commented Apr 9, 2024

tboddyspargo commented Apr 9, 2024

mdp1125 commented Apr 9, 2024

tboddyspargo commented Apr 9, 2024

INTERNAL Error: Attempted to access index N within vector of size N #10950

INTERNAL Error: Attempted to access index N within vector of size N #10950

Comments

davidcorcoran commented Mar 1, 2024 • edited Loading

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest nightly build?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

carlopi commented Mar 1, 2024

sirfz commented Mar 4, 2024

davidcorcoran commented Mar 21, 2024

tboddyspargo commented Mar 26, 2024

tboddyspargo commented Mar 26, 2024 • edited Loading

Tmonster commented Apr 9, 2024

tboddyspargo commented Apr 9, 2024

mdp1125 commented Apr 9, 2024

tboddyspargo commented Apr 9, 2024

davidcorcoran commented Mar 1, 2024 •

edited

Loading

tboddyspargo commented Mar 26, 2024 •

edited

Loading