[Format] Simplify Execute and Query interface #61

zeroshade · 2022-08-10T16:15:01Z

Rather than the separate Execute / GetStream functions, it might be better to follow something similar to FlightSQL's interface or Go's database/sql API.

Have two functions:

Execute a query without expecting a result set: Execute(struct AdbcConnection*, const char*, struct AdbcResult*, struct AdbcError*) where AdbcResult would contain an optional LastInsertedID and Number of Rows affected
Execute a query to retrieve a result set: Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) where the ArrowArrayStream is populated with the result set.

Corresponding methods would exist for a Statement just without the need for taking the const char* as it would already be prepared in the statement.

Some benefits of this idea:

With the interface enforcing the idea of the query -> retrieving the results being a single API call it makes concurrency easier to handle (any complexity would be handled by the driver implementation) by consumers of the interface.
Drivers can be aware of whether or not a user is expecting a result set and can operate accordingly. They don't have to interrogate the backend to know whether or not a result set is available or if the query was an update/insert vs a select.

The text was updated successfully, but these errors were encountered:

lidavidm · 2022-08-10T16:52:47Z

I think it may still have sense to have a generic Execute to ease compatibility with APIs that do not differentiate between the types of queries (and note JDBC has all three!), but having an execute-with-rows-affected and execute-with-result-set is reasonable.

Concurrency is a separate discussion; I'm torn on whether we should push the complexity into the driver (and declare up front that, for example, everything must be thread-safe), or declare that everything is not thread-safe and leave clients to deal with it (so Go would have to lock the statement to use it), or (like DBAPI in Python) provide a way for drivers to indicate what they support (this is probably the worst of both worlds though).

lidavidm · 2022-08-10T16:54:29Z

Also, possibly the driver manager could define execute-with-result-set and execute-with-rows-affected in terms of the generic execute + generic getters to retrieve the affected rows/last row ID. So a basic driver (e.g. SQLite, which doesn't differentiate) would be straightforward to implement still but clients can be mostly none the wiser. It does increase the actual API surface to have all those permutations though.

lidavidm · 2022-08-10T16:58:43Z

CC @pitrou, @hannes, @krlmlr if you have opinions here?

@lwhite1 had the same feedback about executeQuery/execute in Java last month. So for consistency a query method returning a result set makes sense at the very least.

lidavidm · 2022-08-12T16:50:36Z

Another reason to differentiate between queries with/without result sets: in a Postgres driver, that means we know when we can attempt to use COPY (...) TO STDOUT (FORMAT binary) (akin to DuckDB's integration with Postgres) to get bulk binary data instead of parsing data one row at a time.

krlmlr · 2022-08-14T15:37:12Z

Just to be sure we're on the same page:

a "query" is a single SQL string that can return a result set but doesn't have to
a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added
we're debating whether the caller should declare up front if a result is expected

In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.

The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) with an optional third argument (can be NULL) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?

What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?

lidavidm · 2022-08-14T17:33:18Z

Just to be sure we're on the same page:
* a "query" is a single SQL string that can return a result set but doesn't have to

* a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added

* we're debating whether the caller should declare up front if a result is expected
In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.

That sounds reasonable. (adbc.h should define some terminology up front!) Though I'm still leaning towards having a separate Statement struct even for things that aren't necessarily prepared statements, just because that makes it easier to add options (as suggested) without potentially breaking ABI.

The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) with an optional third argument (can be NULL) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?

A single method sounds reasonable. The Statement struct can keep the fine-grained APIs. I'd possibly argue that if you want more metadata, you should resort to the lower level APIs. Maybe we can just add the row count parameter as well, for convenience?

Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, size_t*, struct AdbcError*)

If the query returns a result set, the ArrowArrayStream contains the result set and the size_t is the number of rows (if known, else 0). If not, the ArrowArrayStream is not set (or: contains last inserted IDs, if supported and configured) and the size_t contains rows affected.

What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?

I think we can just ignore the result set in that case. This also makes it easy to support "Retrieve the last inserted id for inserts into an auto-increment table" that @zeroshade mentioned in #55: the caller can set an option to return the inserted IDs, and they'll come back via the result set.

krlmlr · 2022-08-14T18:00:17Z

Last inserted IDs (or the results of computed columns in general, for that matter) can be obtained with the RETURNING syntax for most databases, SQL Server has OUTPUT . This seems far more robust to me than a "last inserted ID".

R DBI has dbGetQuery() and dbExecute() as methods that are sufficient for most users, and dbSendQuery() + dbBind() + dbFetch() + dbClearResult() or dbSendStatement() + dbBind() + dbClearResult() . If ADBC is to remain header-only, what would a convenience method look like?

lidavidm · 2022-08-14T18:07:19Z

Right, and on the other hand, databases like SQLite have no reliable way to get the info. But APIs like JDBC, Python DBAPI, and Go's database API expose standard ways to get last inserted ID(s). I wasn't originally concerned with it since it feels like the 'wrong' use case to worry about so I'm fine ignoring it.

I think what's being discussed here is basically to have the same things as DBI, and the Query you propose seems sufficient? I suppose a header-only library cannot provide an (overridable) convenience function, so Query would have to be a full-fledged API that can be stubbed by the driver manager.

lidavidm · 2022-08-15T19:15:40Z

Ok, so how does this sound (here, I'm ignoring #59 provide a "just query" method):

Remove AdbcStatementGetStream
Change AdbcStatementExecute to AdbcStatementExecute(struct AdbcStatement*, struct ArrowArrayStream*, size_t*, struct AdbcError*) where the ArrowArrayStream is for the result set and the size_t is for the rows affected (if known, else 0)
Change the various AdbcConnectionGetInfo methods to take struct ArrowArrayStream* as their output instead of struct AdbcStatement* (but clarify they still may count as a 'statement' for the purposes of concurrency - that is, they may acquire the same locks, etc. internally)

zeroshade · 2022-08-15T20:07:15Z

That seems reasonable to me @lidavidm

lidavidm · 2022-08-16T20:33:43Z

Ok, while digging I realized we also need to account for partitioning (=Flight/Flight SQL's ability to return multiple endpoints in a GetFlightInfo). So what I'm tentatively about to refactor towards is this:

/// \brief Execute a statement and get the results.
///
/// This invalidates any prior result sets.
///
/// \param[in] statement The statement to execute.
/// \param[in] out_type The expected result type:
///   - ADBC_OUTPUT_TYPE_NONE if the query should not generate a
///     result set;
///   - ADBC_OUTPUT_TYPE_ARROW for an ArrowArrayStream;
///   - ADBC_OUTPUT_TYPE_PARTITIONS for a count of partitions (see \ref
///     adbc-statement-partition below).
///   The result set will be in out.
/// \param[out] out The results. Must be NULL for output type NONE, a
///   pointer to an ArrowArrayStream for ARROW_ARRAY_STREAM, or a
///   pointer to a size_t for PARTITIONS.
/// \param[out] rows_affected The number of rows affected if known,
///   else -1. Pass NULL if the client does not want this information.
/// \param[out] error An optional location to return an error
///   message if necessary.
ADBC_EXPORT
AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, int output_type,
                                    void* out, int64_t* rows_affected,
                                    struct AdbcError* error);

/// \brief No results are expected from AdbcStatementExecute.  Pass
///   NULL to out.
#define ADBC_OUTPUT_TYPE_NONE 0
/// \brief Arrow data is expected from AdbcStatementExecute.  Pass
///   ArrowArrayStream* to out.
#define ADBC_OUTPUT_TYPE_ARROW 1
/// \brief Partitions are expected from AdbcStatementExecute.  Pass
///   size_t* to out to get the number of partitions, and use
///   AdbcStatementGetPartitionDesc to get a partition.
///
/// Drivers are not required to support partitioning.  In that case,
/// AdbcStatementExecute will return ADBC_STATUS_NOT_IMPLEMENTED.
#define ADBC_OUTPUT_TYPE_PARTITIONS 2

If we decide to add native support for "affected row IDs", that could also go here. Otherwise I don't expect that we'd need more enum variants here.

lidavidm · 2022-08-17T15:15:33Z

Ok, I think we'll want to tweak the partitioning interface further after this refactor, see #68.

Also I suppose this supersedes the 'affected row count' issue in #55.

Fixes apache#61.

lidavidm · 2022-08-24T21:25:24Z

@zeroshade just to follow up, does the linked PR look reasonable (at least interface-wise)?

krlmlr · 2022-08-25T04:38:22Z

For the record, SQLite now also has RETURNING: https://www.sqlite.org/lang_returning.html . I'm in favor of descoping "last inserted ID" for now.

How is the caller supposed to know the size of the void* out buffer? Should it be void** out instead?

I suspect the type of out depends on output_type? Would it be worth having separate methods per output type?

lidavidm · 2022-08-25T11:32:47Z

out depends on output_type, yes, so it's either ArrowArrayStream*, NULL, or some structure that I haven't fleshed out yet in the case of partitioned results. I'm mostly indifferent to whether it's a single function or multiple functions.

paleolimbot · 2022-08-25T12:07:24Z

It seems like this would be cleaner as multiple functions (as Kirill noted) by output type? Maybe:

ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteVoid(struct AdbcStatement* statement, struct AdbcError* error);
ADBC_EXPORT AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, struct ArrowArrayStream* stream, struct AdbcError* error);
ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteParitioned(struct AdbcStatement* statement, void* something, struct AdbcError* error);

lidavidm · 2022-08-25T12:51:58Z

Alright, I'll split into three functions then.

lidavidm · 2022-08-25T18:02:49Z

Updated the PR.

* [Format][C][Java][Python] Simplify execute/query interface Fixes #61. * Update vendored nanoarrow * [C] Split Execute

lidavidm changed the title ~~Simplify Execute and Query interface~~ [Format] Simplify Execute and Query interface Aug 10, 2022

lidavidm mentioned this issue Aug 12, 2022

[C] Basic libpq-based driver #65

Merged

lidavidm mentioned this issue Aug 17, 2022

[Format] Improve partitioned data interface #68

Closed

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 17, 2022

[Format][C][Java] Simplify Execute/Query interface

fb2f385

Fixes apache#61.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 18, 2022

[Format][C][Java] Simplify Execute/Query interface

9676643

Fixes apache#61.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 18, 2022

[Format][C][Java] Simplify Execute/Query interface

ec591d2

Fixes apache#61.

lidavidm mentioned this issue Aug 18, 2022

[Format][C][Java][Python] Simplify execute/query interface #69

Merged

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 19, 2022

[Format][C][Java] Simplify Execute/Query interface

ad03a0d

Fixes apache#61.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 19, 2022

[Format][C][Java][Python] Simplify execute/query interface

3f18c92

Fixes apache#61.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 22, 2022

[Format][C][Java][Python] Simplify execute/query interface

df201a2

Fixes apache#61.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 24, 2022

[Format][C][Java][Python] Simplify execute/query interface

6b6d88f

Fixes apache#61.

lidavidm mentioned this issue Aug 25, 2022

"1.0" Tasks #76

Closed

12 tasks

lidavidm mentioned this issue Aug 25, 2022

[Python] Driver distribution #53

Closed

lidavidm closed this as completed in #69 Aug 26, 2022

lidavidm added a commit that referenced this issue Aug 26, 2022

[Format][C][Java][Python] Simplify execute/query interface (#69)

d8c5821

* [Format][C][Java][Python] Simplify execute/query interface Fixes #61. * Update vendored nanoarrow * [C] Split Execute

lidavidm added this to the 0.1.0 milestone Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Format] Simplify Execute and Query interface #61

[Format] Simplify Execute and Query interface #61

zeroshade commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 12, 2022

krlmlr commented Aug 14, 2022

lidavidm commented Aug 14, 2022

krlmlr commented Aug 14, 2022

lidavidm commented Aug 14, 2022

lidavidm commented Aug 15, 2022

zeroshade commented Aug 15, 2022

lidavidm commented Aug 16, 2022

lidavidm commented Aug 17, 2022

lidavidm commented Aug 24, 2022

krlmlr commented Aug 25, 2022

lidavidm commented Aug 25, 2022

paleolimbot commented Aug 25, 2022

lidavidm commented Aug 25, 2022

lidavidm commented Aug 25, 2022

[Format] Simplify Execute and Query interface #61

[Format] Simplify Execute and Query interface #61

Comments

zeroshade commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 10, 2022

lidavidm commented Aug 12, 2022

krlmlr commented Aug 14, 2022

lidavidm commented Aug 14, 2022

krlmlr commented Aug 14, 2022

lidavidm commented Aug 14, 2022

lidavidm commented Aug 15, 2022

zeroshade commented Aug 15, 2022

lidavidm commented Aug 16, 2022

lidavidm commented Aug 17, 2022

lidavidm commented Aug 24, 2022

krlmlr commented Aug 25, 2022

lidavidm commented Aug 25, 2022

paleolimbot commented Aug 25, 2022

lidavidm commented Aug 25, 2022

lidavidm commented Aug 25, 2022