[Java] IO Error: Connection error for HTTP HEAD #10047

NewtonVan · 2023-12-21T01:07:25Z

What happens?

COPY (
    SELECT * FROM read_json_auto(${s3_src_file}, format='newline_delimited')
) TO ${s3_dst_file} (FORMAT PARQUET);

It's ok to run in duckdb and demo(use the same api with program has issues). In our own program, we use aws java api to upload file, and duckdb is used to help us achieve a row to column operation. And it report following exception.

Caused by: java.sql.SQLException: Invalid Error: IO Error: Connection error for HTTP HEAD to '<src>.json'
	at org.duckdb.DuckDBNative.duckdb_jdbc_prepare(Native Method)
	at org.duckdb.DuckDBPreparedStatement.prepare(DuckDBPreparedStatement.java:106)

To Reproduce

I'm sorry it's quite hard to reproduce. The same command runs well in both duckdb and UT. The only similar issue I found is #9232 , #9647 . A reasonable guess is that our s3 client occupy some connection resource and duckdb reach its limit.

OS:

macOS, Linux

DuckDB Version:

0.9.2

DuckDB Client:

java

Full Name:

yanhui chen

Affiliation:

ApeCloud

Have you tried this on the latest `main` branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Yes, I have

The text was updated successfully, but these errors were encountered:

NewtonVan · 2023-12-21T03:04:50Z

Is there any options for me to control duckdb's connection configuration?

samansmink · 2024-01-03T10:14:32Z

@NewtonVan There's the keep alive option which might work as a workaround (#9648)

To use it, run SET http_keep_alive=false in duckdb

edit: I just realized this feature was not included in 0.9.2 so this will only be available in the dev builds of duckdb currently

samansmink · 2024-01-05T10:04:03Z

@NewtonVan quick question: does the ${s3_src_file} contain a glob pattern? If so, how many files are there and how many rows per json file are there?

daviddwlee84 · 2024-01-12T03:16:32Z

Facing the same issue when using Python

import duckdb

con = duckdb.connect()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.execute("SELECT * FROM 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv';")

Failed after about 300 seconds of execution

IO Error: SSLConnection error for HTTP HEAD to 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv'

Python 3.10
duckdb                    0.8.1
duckdb-engine             0.9.2

erezinman · 2024-01-18T21:19:49Z

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

erezinman · 2024-01-18T22:13:09Z

@samansmink It seems that the problem is greater than that. It seems that the connections are not closed even when the connection class is collected in python. This error persists even if a new duckdb connection is created on every usage.

erezinman · 2024-01-18T22:19:12Z

Another observation: This error has to do with the number of threads. On a thread that had previously created a (single) connection, a single file read (in my case using read_parquet) will raise not exception, but reading multiple files will raise it.

samansmink · 2024-01-22T09:17:35Z

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

could you try:

python3 -m pip install duckdb==0.9.3.dev2934

Python 3.11.6 (main, Oct  2 2023, 20:46:14) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
>>> import duckdb
>>> con = duckdb.connect()
>>> con.query('set http_keep_alive=false;')
>>>

sometimes the extensions for pre-releases aren't uploaded yet, but this should work

Could you provide us with a reproducible script? Right now im not really sure how to reproduce this. I tried to run the python query you provided for an extended time, but I could not get it to fail.

erezinman · 2024-01-22T11:23:30Z

@samansmink

I might try to reproduce it later. It happened inside a very complex system. It happened inside a thread pool 's operation (~30 threads), and after reading ~100 parquet files from a GCS backend using the httpfs.

cddr · 2024-01-24T14:25:12Z

We see this trying to read parquet glob from S3 too so I don't think it's just JSON

mustafahasankhan · 2024-02-06T11:37:31Z

Facing the same issue when doing read_parquet glob on GCP bucket.

mustafahasankhan · 2024-02-08T13:26:47Z

Folks,
I was able to reproduce the issue when there are 1200+ parquet files in a hive partitioned prefix but not when there are 600 files.
Note:
I just copied the same file even then it raises same error. The file has 10K records and is 20MB in size.

samansmink · 2024-03-01T16:06:06Z

I have done a little more investigating and I can definitely reproduce this, but only when I increase the thread count to pretty high numbers (~300). I guess this is expected behaviour though and not really a bug. Also disabling http_keep_alive completely solves the issues afaict.

If people have a reproduction of this with relatively low number of files and threads, im very interested, ideally I would get:

a data set which can easily trigger this
the thread count required to trigger this

Otherwise some detailed info on the dataset is also helpful:

in case of parquet especially: row_group_size, tuples per file, total files, number of columns

oh and please test on v0.10.0 :)

@mustafahasankhan do you know at what thread count this occurs and whether set http_keep_alive=false;, which is now available, resolves the issue?

mustafahasankhan · 2024-03-04T09:57:20Z

@samansmink I was able to reproduce it with thread count of even 1 with 1200+ files.
Yes set http_keep_alive=false; seems to resolve the issue with v0.10.0

samansmink · 2024-03-04T10:02:15Z

@mustafahasankhan thanks, that's interesting, that definitely sounds like a bug.

Could you give me access to this dataset? If not, could you give me some more details with which I can reproduce this? You can ping me at sam@duckdblabs.com

chlimaferreira · 2024-05-22T13:07:03Z

For those using MinIO in Docker, use port 9000:9000 instead of 9000:9001.

docker run -p 9000:9000 -p 9090:9090 --name minio_name -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" -v ${HOME}/minio/data:/data quay.io/minio/minio server /data --console-address ":9090"

In my case it worked.

NewtonVan added the needs triage label Dec 21, 2023

szarnyasg added the under review label Jan 2, 2024

duckdblabs-bot removed the needs triage label Jan 2, 2024

szarnyasg mentioned this issue Apr 17, 2024

read_parquet('s3://...', union_by_name=True) throws "Connection error for HTTP HEAD" but union_by_name=False does not #11695

Closed

2 tasks

nyc-de mentioned this issue May 7, 2024

[BUG] duckdb connection to s3 over HTTPFS errors after indeterminate amount of time/connections mage-ai/mage-ai#5039

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Java] IO Error: Connection error for HTTP HEAD #10047

[Java] IO Error: Connection error for HTTP HEAD #10047

NewtonVan commented Dec 21, 2023 •

edited

NewtonVan commented Dec 21, 2023

samansmink commented Jan 3, 2024 •

edited

samansmink commented Jan 5, 2024

daviddwlee84 commented Jan 12, 2024

erezinman commented Jan 18, 2024 •

edited

erezinman commented Jan 18, 2024

erezinman commented Jan 18, 2024

samansmink commented Jan 22, 2024

erezinman commented Jan 22, 2024

cddr commented Jan 24, 2024

mustafahasankhan commented Feb 6, 2024

mustafahasankhan commented Feb 8, 2024

samansmink commented Mar 1, 2024 •

edited

mustafahasankhan commented Mar 4, 2024

samansmink commented Mar 4, 2024

chlimaferreira commented May 22, 2024

[Java] IO Error: Connection error for HTTP HEAD #10047

[Java] IO Error: Connection error for HTTP HEAD #10047

Comments

NewtonVan commented Dec 21, 2023 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

Have you tried this on the latest main branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

NewtonVan commented Dec 21, 2023

samansmink commented Jan 3, 2024 • edited

samansmink commented Jan 5, 2024

daviddwlee84 commented Jan 12, 2024

erezinman commented Jan 18, 2024 • edited

erezinman commented Jan 18, 2024

erezinman commented Jan 18, 2024

samansmink commented Jan 22, 2024

erezinman commented Jan 22, 2024

cddr commented Jan 24, 2024

mustafahasankhan commented Feb 6, 2024

mustafahasankhan commented Feb 8, 2024

samansmink commented Mar 1, 2024 • edited

mustafahasankhan commented Mar 4, 2024

samansmink commented Mar 4, 2024

chlimaferreira commented May 22, 2024

NewtonVan commented Dec 21, 2023 •

edited

Have you tried this on the latest `main` branch?

samansmink commented Jan 3, 2024 •

edited

erezinman commented Jan 18, 2024 •

edited

samansmink commented Mar 1, 2024 •

edited