Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTERNAL Error: Attempted to access index N within vector of size N #10950

Closed
1 task done
davidcorcoran opened this issue Mar 1, 2024 · 9 comments · Fixed by Tmonster/duckdb#149 or #11596
Closed
1 task done

Comments

@davidcorcoran
Copy link

davidcorcoran commented Mar 1, 2024

What happens?

Running the given query causes INTERNAL Error: Attempted to access index 3 within vector of size 3 after which point the database will produce the error 'Error: FATAL Error: Failed: database has been invalidated because of a previous fatal error. The database must be restarted prior to being used again.'

To Reproduce

Run the following querying.

SELECT t1.Transaction_ID
FROM transactions t1
WHERE t1.Transaction_ID IN
    (SELECT t2.Referred_Transaction_ID
     FROM transactions t2
     WHERE t2.Transaction_ID IN (123606, 123602, 131522, 123604, 131470)
         AND t2.Transaction_ID NOT IN (SELECT t2_filter.Transaction_ID FROM transactions t2_filter))

That query is a very stripped down version of the query which caused the error for me. So although it looks odd, and pointless, it still produces the INTERNAL ERROR.
Use the following database:
bug.db.zip

Just to note something odd: If you export as CSV from that database, then reimport into a new database, the query will succeed.

OS:

x64, aarch64

DuckDB Version:

0.9.2, 0.10.0, nightly on 2024-03-01

DuckDB Client:

Java, CLI

Full Name:

David Corcoran

Affiliation:

Topaz (https://topaz.technology/)
We use DuckDB as an in memory pivoting database. We migrated from originally using MonetDB for the same job.

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@carlopi
Copy link
Contributor

carlopi commented Mar 1, 2024

Can you edit the repro to:

SELECT t1.Transaction_ID
FROM transactions t1
WHERE t1.Transaction_ID IN
      (SELECT t2.Referred_Transaction_ID
       FROM transactions t2
       WHERE t2.Transaction_ID IN (123606, 123602, 131522, 123604, 131470)
         AND t2.Transaction_ID NOT IN (SELECT t2_filter.Transaction_ID FROM transactions t2_filter))

Thanks for the report

@sirfz
Copy link

sirfz commented Mar 4, 2024

I get this error randomly when querying CSV files (using read_csv_auto). Apologies for the lack of reproduction code/data but is it possible this happens when running aggregation queries (e.g. select count()) with read_csv_auto? Right now I consistently hit the error by running this query:

conn.query("select count() from read_csv_auto('/path/to/pairs*_data.csv.gz', header=true, filename = true)").show()

The pattern matches two files.

Although I can load the files without issue:

dp = conn.query("select * from read_csv_auto('/path/to/pairs*_data.csv.gz', header=true, filename = true)").arrow()

@davidcorcoran
Copy link
Author

We rewrote our query to use an IN instead of the correlated subquery and it's been working fine for a few weeks.

@tboddyspargo
Copy link

I ran into INTERNAL Error: Attempted to access index 62 within vector of size 62 when loading a ~100MiB CSV file that had some type diversity and long string values, but then a redundant header row was inserted in the middle of the file (i.e. someone had just concatenated the rows of two CSV files without removing the header row from the second file). We useheader=true, names, null_padding , and ignore_errors (didn't help in this case). I've only observed this error on linux/amd64 platform, although I'm not sure that's necessarily relevant.

@tboddyspargo
Copy link

tboddyspargo commented Mar 26, 2024

UPDATE: I have a reproducer for python! Hopefully this helps to identify a root cause!

This reproducer demonstrates one flavor of this issue on both v0.9.2 and v0.10.1 as well as the latest pre-release version (v0.10.2.dev213).

conn = duckdb.connect()
rel: duckdb.DuckDBPyRelation = conn.read_csv(
    "/tmp/youtube_videos_sm.csv",
    header=True,
    null_padding=True,
)
rel.show()

youtube_videos_sm.csv

Result:

duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63

Shrinking the file further (by removing rows) seems to cause a different CSV dialect to be sniffed (i.e different delimiter or quote character or escape character), and the InternalException error no longer occurs. However, at that point, I can't seem to correct the dialect options manually... Providing the ones I want just seems to provoke: duckdb.duckdb.InvalidInputException: Invalid Input Error: Error in file "[...]": CSV options could not be auto-detected. Consider setting parser options manually..

@Tmonster
Copy link
Contributor

Tmonster commented Apr 9, 2024

UPDATE: I have a reproducer for python! Hopefully this helps to identify a root cause!

This reproducer demonstrates one flavor of this issue on both v0.9.2 and v0.10.1 as well as the latest pre-release version (v0.10.2.dev213).

def test_csv_load():
    conn = duckdb.connect()
    rel: duckdb.DuckDBPyRelation = conn.read_csv(
        "/tmp/youtube_videos_sm.csv",
        header=True,
        null_padding=True,
    )
    rel.show()

youtube_videos_sm.csv

Result:

duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63

Shrinking the file further (by removing rows) seems to cause a different CSV dialect to be sniffed (i.e different delimiter or quote character or escape character), and the InternalException error no longer occurs. However, at that point, I can't seem to correct the dialect options manually... Providing the ones I want just seems to provoke: duckdb.duckdb.InvalidInputException: Invalid Input Error: Error in file "[...]": CSV options could not be auto-detected. Consider setting parser options manually..

I @tboddyspargo I've managed to fix the issue the author had, but could not manage to reproduce your issue. I also looked at your csv file and there seem to be 2 types of separators, both "," and "|". I think that is why you get the InvalidInputException. Can you try specifying a delimiter in your options? If the problem persists, can you also try with a fresh install of duckdb? That way it is more likely that I can reproduce the issue.

@tboddyspargo
Copy link

I also looked at your csv file and there seem to be 2 types of separators, both "," and "|".

You're right! I didn't notice it before, but there's inconsistency with quoting and escaping quotes in the tags column. That would explain why the sniffer was getting confused depending on how I truncated the file. It seems to have been sniffing \0 quote and escape characters whenever it sampled among those inconsistent rows. With null_padding=true, that resulted in a bunch of extra columns from the , characters in the rows with long description values. With null_padding=false, it usually just truncated any of those extra columns that didn't align with the header row.

I @tboddyspargo I've managed to fix the issue the author had, but could not manage to reproduce your issue.

It seems I may have too aggressively truncated the file without re-testing on other versions. Here are the results of my repro with youtube_videos_sm.csv:

>>> print(duckdb.__version__)
0.9.2
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InternalException: INTERNAL Error: Attempted to access index 63 within vector of size 63
>>> print(duckdb.__version__)
0.10.1
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "/Users/tyler/Downloads/youtube_videos_sm_broken.csv".
CSV options could not be auto-detected. Consider setting parser options manually.
>>> print(duckdb.__version__)
0.10.2-dev311
>>> conn = duckdb.connect()
>>> rel: duckdb.DuckDBPyRelation = conn.read_csv(
...     "/Users/tyler/Downloads/youtube_videos_sm_broken.csv",
...     header=True,
...     null_padding=True,
... )
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
duckdb.duckdb.InvalidInputException: Invalid Input Error: Error when sniffing file "/Users/tyler/Downloads/youtube_videos_sm_broken.csv".
CSV options could not be auto-detected. Consider setting parser options manually.

With the issue (presumably in the sniffer) addressed after 0.9.2, I think the Invalid Input Error is appropriate given the formatting issues with the file. I don't think I have any remaining reproducible behavior that I'm concerned about. If I'm able to identify concerning scenarios in the future, I'll raise separate issues. Thank you for checking back in with me!

@mdp1125
Copy link

mdp1125 commented Apr 9, 2024

hey @tboddyspargo I just encountered a similar case Original error: "Attempted to access index 1 within vector of size 1"", possibly due to long strings and formatting. I seem to be blocked on this, do you know what I should do to the file to resolve this? Any help would be really appreciated, thank you

@tboddyspargo
Copy link

Based on this issue, you may just want to try with the latest pre-release version to start with. If that doesn't address the issue, then sharing a reproducer that fails on the latest pre-release would help the maintainers to effectively triage and investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants