Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: Http connect timeout no longer wired up #1384

Closed
Aklakan opened this issue Jun 14, 2022 · 8 comments
Closed

Regression: Http connect timeout no longer wired up #1384

Aklakan opened this issue Jun 14, 2022 · 8 comments
Labels

Comments

@Aklakan
Copy link
Contributor

Aklakan commented Jun 14, 2022

Version

4.6.0-SNAPSHOT

What happened?

When trying to re-activate one of my sparql benchmark processors (LSQ) I noticed that the following line no longer works with a remote HTTP connection:

try(QueryExecution qe = conn.query(query)) {
    qe.setTimeout(connectionTimeoutForRetrieval, executionTimeoutForRetrieval);
}

Initial/connect timeout may be a bit of a niche feature but when working with third-party endpoints it can be a very handy thing - I once had a case where a downed sparql endpoint would hold the connection for several minutes; that's why eventually I made use of this Jena feature.

In my case, the line above ends up at QueryExecHTTPBuilder which raises a UnsupportedOperationException:

class QueryExecHTTPBuilder {
    @Override
    public QueryExecHTTPBuilder initialTimeout(long timeout, TimeUnit timeUnit) {
        throw new UnsupportedOperationException();
    }
}

Attempting to using the more recent conn.newQuery().setTimeout() API also does not work, because it lacks lacks a method to set the initial (aka connect) timeout alltogether.

According to stack overflow the Java http client actually supports this feature so it seems it's just a matter of wiring this up again:

    // set the connection timeout value to 30 seconds (30000 milliseconds)
    final HttpParams httpParams = new BasicHttpParams();
    HttpConnectionParams.setConnectionTimeout(httpParams, 30000);
    client = new DefaultHttpClient(httpParams);

Which environment is running?

  • Linux

Relevant output and stacktrace

No response

Are you interested in making a pull request?

Yes

@Aklakan Aklakan added the bug label Jun 14, 2022
@afs
Copy link
Member

afs commented Jun 14, 2022

HttpConnectionParams is org.apache.http.params.HttpConnectionParams.

Jena now uses java.net.http.HttpClient, not Apache HttpClient.

Apache HttpClient is only in the dependencies because jsonld-java uses it. If Titanium becomes the only JSON-LD provider, then it won't be needed. We can probably drop the explicit mention now. jena used to need to control the version.

In java.net.http, connection timeout is "TCP connection setup" timeout. See java.net.http.HttpClient.Builder.connectTimeout

In the case where a new connection needs to be established, ...
If a new connection does not need to be established, for example if a connection can be reused
from a previous request, then this timeout duration has no effect.

In the case of RDFConnectionit does reuse the java.net.http.HttpClient and hence connections.
Unless provided, there is only one HttpClient (for all HTTP activity). So likely only the first use of the RDFConnection has any chance of a connection setup to a new remote and hence timeout.

HTTP/2 : there is no TCP connection setup other that the one shared host-host connection.

HTTP/3 : different again.

I assumed the lack of timeout controls for java.net.http.HttpClient is looking to the future of HTTP/2 and HTTP/3 where things are different.


QueryExecution.setTimeout is deprecated. There is some compatibility code (a hidden delayed building) because there is a "build-use" pattern nowadays -- QueryExec does not allow changes of setup after building.

@Aklakan
Copy link
Contributor Author

Aklakan commented Jun 14, 2022

Sigh, too many issues lately so I it happens I do things too hastily - I didn't notice that the stack overflow snippet referred to apache http client because the title referred to java http and I googled for java net http.

I see, the method now is java.net.http.HttpClient.Builder.connectTimeout (javadoc)

So I'll look into supplying my own HttpClient instance via RDFConnectionRemoteBuilder#httpClient then which would be good enough for my use case.

@afs
Copy link
Member

afs commented Jun 14, 2022

Yes - setting up the HttpClient is the right way to do it.

It is the only way as it get frozen at that point. It is possible to manually clone the setting of an HttpClient but it's afresh setup and not having the same connection caching (HTTP/1.1).

@Aklakan
Copy link
Contributor Author

Aklakan commented Jun 15, 2022

Hm again a twist: So with the connect timeout no longer present on the connection itself I have to set it on initialization.

However for this part I was using my own assembler-like setup which sets up a factory for RDFConnection objects from an RDF document. Back then I didn't realize that Jena's assembler system is generic so it can create any java object from RDF.

In my case, this allows for linking RDF benchmark results back to the connection setup and the involved thresholds.

Rather then extending my own system I'd rather integrate with Jena assemblers.
I think there is not yet an assembler for RDFConnection instances - is there?

So a rought draft would look like

#conn.ttl
<#httpClient> a ja:HttpClient ;
  ja:connectTimeout 10 . # seconds

<#conn> rdf:type ja:HttpRDFConnection;
     ja:destination <http://dbpedia.org/sparql> ;
     ja:httpClient <#httpClient> .

Followed by

Resource spec = ModelFactory.load("conn.ttl").getResource("#conn");
RDFConnection conn = RDFConnectionFactory.assemble(spec);

However, instead of returning only a single connection it may be better to create a factory for connections:

/** A factory/supplier of RDFConnection instances. Similar to a JDBC DataSource. */
public interface RDFDataSource
{
    RDFConnection getConnection();
}

RDFDataSourceFactory.assemble(...);

Would this be a reasonable approach? And would it make sense to add this functionality to Jena itself at some point?
(From my side it may be some time until I'd get around to do that because I'll try to resolve the other open issues first - so I am just asking for a perspective here)

@afs
Copy link
Member

afs commented Jun 15, 2022

Reasonable yes. I use assembler to make things other than datasets quite often.

https://github.com/Telicent-io/jena-fuseki-kafka/blob/main/ex-connector.ttl

is a connector between Kafka and Jena.

Rather then extending my own system I'd rather integrate with Jena assemblers.

There is no need to be in Jena, nor use ja: just as there no need for vocabularies to be in Jena.Any code can register them. AssemblerUtils. This also means you can develop, try out and change based on your usage before it gets "frozen" by putting it in Jena.

Is it something general other users would want? From what you describe (RDFDataSource), it is part of a bigger picture.

Aside: I've been toying with making more of assembling at the graph level.

https://github.com/Telicent-io/jena-fuseki-kafka/blob/main/jena-kafka-connector/src/main/java/org/apache/jena/kafka/KafkaConnectorAssembler.java#L112

@afs
Copy link
Member

afs commented Jun 15, 2022

If this issue is finished, could you close it please? It is very helpful to wrap things up - it helps track the state of the project.

We have the "discussions" area on GH for longer running and open ended discussions.

"Issues" are ideally actionable. There will more of them so longer discussions that involve general users may can get a little lost if mixed into a big "issues" pool.

@Aklakan
Copy link
Contributor Author

Aklakan commented Jun 16, 2022

Is it something general other users would want?

Well it's the analogy to JDBC, where DataSource is the factory for connection instances.
Based on DataSource, there are systems such as Hikari that implement connection pooling, automatic stale connection detection and replay of statements that failed due to connection loss.
The pooling aspect is particularly nice, because it enables components to frequently acquire short-lived connections (OLTP).
Closing those connections make them go back into the local pool without actually closing the physical connection to the database.

Also in the LSQ system, I use the RDF-variant of that interface for connection recovery: If it turns out that a connection was lost (e.g. due to a system-under-test actually crashing on a query) I can call getConnection() in a loop in an attempt to reconnect (e.g. while the docker container restarts). using whatever settings were in place when the DataSource was created.

In the bigger picture, Spring has the EmbeddedDatabase interface which extends DataSource with a shutdown() method (Nowadays I'd prefer AutoClosable.close()). This approach allows for abstracting e.g. a local Fuseki and a Virtuoso instances as RDFDataSources/RDFDatabases). The close() method could then e.g. shutdown the Virtuoso docker container.

So this pattern with an (RDF)DataSource is not my invention - but I noticed it's not part of the RDFConnection API.
Conversely, having the interfaces in Jena might make it easier to provide interoperable implementations outside of Jena because applications could rely on those interfaces.

If you want more evidence I could also eventually write to the Jena dev mailing list about it.

This also means you can develop, try out and change based on your usage before it gets "frozen" by putting it in Jena.

Yes, sure, I don't mean to 'spam' Jena with my custom components. But I think an assembler for the already Jena native class RDFConnection+HttpClient (maybe later also RDFDataSource) would be nice. But for now I'll go ahead as you suggest with making the changes in my own code.

I leave the issue open in case you want to leave a comment on the RDFDataSource, other than that you can close it.

@afs
Copy link
Member

afs commented Jun 16, 2022

I confused by whether you are arguing for RDFDataSource or the assembler.

If you are proposing RDFDataSource with all the sophistication you describe (caveat - inevitable overlaps with java.net.http.HttpClient), then great (at the RDFLink level).

An assembler on its own seems isolated and can wait until RDFDataSource.

I could also eventually write to the Jena dev mailing list about it.

That's a good idea.

And/or start a discussion.

@afs afs closed this as completed Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants