Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Java] IO Error: Connection error for HTTP HEAD #10047

Open
1 task done
NewtonVan opened this issue Dec 21, 2023 · 16 comments
Open
1 task done

[Java] IO Error: Connection error for HTTP HEAD #10047

NewtonVan opened this issue Dec 21, 2023 · 16 comments

Comments

@NewtonVan
Copy link

NewtonVan commented Dec 21, 2023

What happens?

COPY (
    SELECT * FROM read_json_auto(${s3_src_file}, format='newline_delimited')
) TO ${s3_dst_file} (FORMAT PARQUET);

It's ok to run in duckdb and demo(use the same api with program has issues). In our own program, we use aws java api to upload file, and duckdb is used to help us achieve a row to column operation. And it report following exception.

Caused by: java.sql.SQLException: Invalid Error: IO Error: Connection error for HTTP HEAD to '<src>.json'
	at org.duckdb.DuckDBNative.duckdb_jdbc_prepare(Native Method)
	at org.duckdb.DuckDBPreparedStatement.prepare(DuckDBPreparedStatement.java:106)

To Reproduce

I'm sorry it's quite hard to reproduce. The same command runs well in both duckdb and UT. The only similar issue I found is #9232 , #9647 . A reasonable guess is that our s3 client occupy some connection resource and duckdb reach its limit.

OS:

macOS, Linux

DuckDB Version:

0.9.2

DuckDB Client:

java

Full Name:

yanhui chen

Affiliation:

ApeCloud

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@NewtonVan
Copy link
Author

Is there any options for me to control duckdb's connection configuration?

@samansmink
Copy link
Contributor

samansmink commented Jan 3, 2024

@NewtonVan There's the keep alive option which might work as a workaround (#9648)

To use it, run SET http_keep_alive=false in duckdb

edit: I just realized this feature was not included in 0.9.2 so this will only be available in the dev builds of duckdb currently

@samansmink
Copy link
Contributor

@NewtonVan quick question: does the ${s3_src_file} contain a glob pattern? If so, how many files are there and how many rows per json file are there?

@daviddwlee84
Copy link

Facing the same issue when using Python

import duckdb

con = duckdb.connect()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.execute("SELECT * FROM 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv';")

Failed after about 300 seconds of execution

IO Error: SSLConnection error for HTTP HEAD to 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv'

Python 3.10
duckdb                    0.8.1
duckdb-engine             0.9.2

@erezinman
Copy link

erezinman commented Jan 18, 2024

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

@erezinman
Copy link

@samansmink It seems that the problem is greater than that. It seems that the connections are not closed even when the connection class is collected in python. This error persists even if a new duckdb connection is created on every usage.

@erezinman
Copy link

Another observation: This error has to do with the number of threads. On a thread that had previously created a (single) connection, a single file read (in my case using read_parquet) will raise not exception, but reading multiple files will raise it.

@samansmink
Copy link
Contributor

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

could you try:

python3 -m pip install duckdb==0.9.3.dev2934
Python 3.11.6 (main, Oct  2 2023, 20:46:14) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
>>> import duckdb
>>> con = duckdb.connect()
>>> con.query('set http_keep_alive=false;')
>>>

sometimes the extensions for pre-releases aren't uploaded yet, but this should work

Could you provide us with a reproducible script? Right now im not really sure how to reproduce this. I tried to run the python query you provided for an extended time, but I could not get it to fail.

@erezinman
Copy link

@samansmink

I might try to reproduce it later. It happened inside a very complex system. It happened inside a thread pool 's operation (~30 threads), and after reading ~100 parquet files from a GCS backend using the httpfs.

@cddr
Copy link

cddr commented Jan 24, 2024

We see this trying to read parquet glob from S3 too so I don't think it's just JSON

@mustafahasankhan
Copy link

Facing the same issue when doing read_parquet glob on GCP bucket.

@mustafahasankhan
Copy link

Folks,
I was able to reproduce the issue when there are 1200+ parquet files in a hive partitioned prefix but not when there are 600 files.
Note:
I just copied the same file even then it raises same error. The file has 10K records and is 20MB in size.

@samansmink
Copy link
Contributor

samansmink commented Mar 1, 2024

I have done a little more investigating and I can definitely reproduce this, but only when I increase the thread count to pretty high numbers (~300). I guess this is expected behaviour though and not really a bug. Also disabling http_keep_alive completely solves the issues afaict.

If people have a reproduction of this with relatively low number of files and threads, im very interested, ideally I would get:

  • a data set which can easily trigger this
  • the thread count required to trigger this

Otherwise some detailed info on the dataset is also helpful:

  • in case of parquet especially: row_group_size, tuples per file, total files, number of columns

oh and please test on v0.10.0 :)

@mustafahasankhan do you know at what thread count this occurs and whether set http_keep_alive=false;, which is now available, resolves the issue?

@mustafahasankhan
Copy link

@samansmink I was able to reproduce it with thread count of even 1 with 1200+ files.
Yes set http_keep_alive=false; seems to resolve the issue with v0.10.0

@samansmink
Copy link
Contributor

@mustafahasankhan thanks, that's interesting, that definitely sounds like a bug.

Could you give me access to this dataset? If not, could you give me some more details with which I can reproduce this? You can ping me at sam@duckdblabs.com

@chlimaferreira
Copy link

For those using MinIO in Docker, use port 9000:9000 instead of 9000:9001.

docker run -p 9000:9000 -p 9090:9090 --name minio_name -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" -v ${HOME}/minio/data:/data quay.io/minio/minio server /data --console-address ":9090"

In my case it worked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants