Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Format] Simplify Execute and Query interface #61

Closed
Tracked by #76
zeroshade opened this issue Aug 10, 2022 · 18 comments · Fixed by #69
Closed
Tracked by #76

[Format] Simplify Execute and Query interface #61

zeroshade opened this issue Aug 10, 2022 · 18 comments · Fixed by #69

Comments

@zeroshade
Copy link
Member

Rather than the separate Execute / GetStream functions, it might be better to follow something similar to FlightSQL's interface or Go's database/sql API.

Have two functions:

  • Execute a query without expecting a result set: Execute(struct AdbcConnection*, const char*, struct AdbcResult*, struct AdbcError*) where AdbcResult would contain an optional LastInsertedID and Number of Rows affected
  • Execute a query to retrieve a result set: Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) where the ArrowArrayStream is populated with the result set.

Corresponding methods would exist for a Statement just without the need for taking the const char* as it would already be prepared in the statement.

Some benefits of this idea:

  • With the interface enforcing the idea of the query -> retrieving the results being a single API call it makes concurrency easier to handle (any complexity would be handled by the driver implementation) by consumers of the interface.
  • Drivers can be aware of whether or not a user is expecting a result set and can operate accordingly. They don't have to interrogate the backend to know whether or not a result set is available or if the query was an update/insert vs a select.
@lidavidm lidavidm changed the title Simplify Execute and Query interface [Format] Simplify Execute and Query interface Aug 10, 2022
@lidavidm
Copy link
Member

I think it may still have sense to have a generic Execute to ease compatibility with APIs that do not differentiate between the types of queries (and note JDBC has all three!), but having an execute-with-rows-affected and execute-with-result-set is reasonable.

Concurrency is a separate discussion; I'm torn on whether we should push the complexity into the driver (and declare up front that, for example, everything must be thread-safe), or declare that everything is not thread-safe and leave clients to deal with it (so Go would have to lock the statement to use it), or (like DBAPI in Python) provide a way for drivers to indicate what they support (this is probably the worst of both worlds though).

@lidavidm
Copy link
Member

Also, possibly the driver manager could define execute-with-result-set and execute-with-rows-affected in terms of the generic execute + generic getters to retrieve the affected rows/last row ID. So a basic driver (e.g. SQLite, which doesn't differentiate) would be straightforward to implement still but clients can be mostly none the wiser. It does increase the actual API surface to have all those permutations though.

@lidavidm
Copy link
Member

CC @pitrou, @hannes, @krlmlr if you have opinions here?

@lwhite1 had the same feedback about executeQuery/execute in Java last month. So for consistency a query method returning a result set makes sense at the very least.

@lidavidm
Copy link
Member

Another reason to differentiate between queries with/without result sets: in a Postgres driver, that means we know when we can attempt to use COPY (...) TO STDOUT (FORMAT binary) (akin to DuckDB's integration with Postgres) to get bulk binary data instead of parsing data one row at a time.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 14, 2022

Just to be sure we're on the same page:

  • a "query" is a single SQL string that can return a result set but doesn't have to
  • a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added
  • we're debating whether the caller should declare up front if a result is expected

In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.

The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) with an optional third argument (can be NULL) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?

What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?

@lidavidm
Copy link
Member

Just to be sure we're on the same page:

* a "query" is a single SQL string that can return a result set but doesn't have to

* a "statement" is the result of preparing an SQL string with placeholders, so that parameters can be added

* we're debating whether the caller should declare up front if a result is expected

In R DBI we're using the term "query" to indicate something that returns a result set (with or without parameters), and a "statement" doesn't return a result set but we can query the number of rows affected.

That sounds reasonable. (adbc.h should define some terminology up front!) Though I'm still leaning towards having a separate Statement struct even for things that aren't necessarily prepared statements, just because that makes it easier to add options (as suggested) without potentially breaking ABI.

The more information we can share with the driver up front, the better. Do we really need two methods, though -- would a single Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, struct AdbcError*) with an optional third argument (can be NULL) also work? Or perhaps an option argument that indicates if we expect a result set, and return metadata (rows affected etc.) via the arrow stream?

A single method sounds reasonable. The Statement struct can keep the fine-grained APIs. I'd possibly argue that if you want more metadata, you should resort to the lower level APIs. Maybe we can just add the row count parameter as well, for convenience?

Query(struct AdbcConnection*, const char*, struct ArrowArrayStream*, size_t*, struct AdbcError*)

If the query returns a result set, the ArrowArrayStream contains the result set and the size_t is the number of rows (if known, else 0). If not, the ArrowArrayStream is not set (or: contains last inserted IDs, if supported and configured) and the size_t contains rows affected.

What should happen to queries that indicate that no result set is expected but where a result set is available? Is a warning useful or annoying?

I think we can just ignore the result set in that case. This also makes it easy to support "Retrieve the last inserted id for inserts into an auto-increment table" that @zeroshade mentioned in #55: the caller can set an option to return the inserted IDs, and they'll come back via the result set.

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 14, 2022

Last inserted IDs (or the results of computed columns in general, for that matter) can be obtained with the RETURNING syntax for most databases, SQL Server has OUTPUT . This seems far more robust to me than a "last inserted ID".

R DBI has dbGetQuery() and dbExecute() as methods that are sufficient for most users, and dbSendQuery() + dbBind() + dbFetch() + dbClearResult() or dbSendStatement() + dbBind() + dbClearResult() . If ADBC is to remain header-only, what would a convenience method look like?

@lidavidm
Copy link
Member

Right, and on the other hand, databases like SQLite have no reliable way to get the info. But APIs like JDBC, Python DBAPI, and Go's database API expose standard ways to get last inserted ID(s). I wasn't originally concerned with it since it feels like the 'wrong' use case to worry about so I'm fine ignoring it.

I think what's being discussed here is basically to have the same things as DBI, and the Query you propose seems sufficient? I suppose a header-only library cannot provide an (overridable) convenience function, so Query would have to be a full-fledged API that can be stubbed by the driver manager.

@lidavidm
Copy link
Member

Ok, so how does this sound (here, I'm ignoring #59 provide a "just query" method):

Remove AdbcStatementGetStream
Change AdbcStatementExecute to AdbcStatementExecute(struct AdbcStatement*, struct ArrowArrayStream*, size_t*, struct AdbcError*) where the ArrowArrayStream is for the result set and the size_t is for the rows affected (if known, else 0)
Change the various AdbcConnectionGetInfo methods to take struct ArrowArrayStream* as their output instead of struct AdbcStatement* (but clarify they still may count as a 'statement' for the purposes of concurrency - that is, they may acquire the same locks, etc. internally)

@zeroshade
Copy link
Member Author

That seems reasonable to me @lidavidm

@lidavidm
Copy link
Member

Ok, while digging I realized we also need to account for partitioning (=Flight/Flight SQL's ability to return multiple endpoints in a GetFlightInfo). So what I'm tentatively about to refactor towards is this:

/// \brief Execute a statement and get the results.
///
/// This invalidates any prior result sets.
///
/// \param[in] statement The statement to execute.
/// \param[in] out_type The expected result type:
///   - ADBC_OUTPUT_TYPE_NONE if the query should not generate a
///     result set;
///   - ADBC_OUTPUT_TYPE_ARROW for an ArrowArrayStream;
///   - ADBC_OUTPUT_TYPE_PARTITIONS for a count of partitions (see \ref
///     adbc-statement-partition below).
///   The result set will be in out.
/// \param[out] out The results. Must be NULL for output type NONE, a
///   pointer to an ArrowArrayStream for ARROW_ARRAY_STREAM, or a
///   pointer to a size_t for PARTITIONS.
/// \param[out] rows_affected The number of rows affected if known,
///   else -1. Pass NULL if the client does not want this information.
/// \param[out] error An optional location to return an error
///   message if necessary.
ADBC_EXPORT
AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, int output_type,
                                    void* out, int64_t* rows_affected,
                                    struct AdbcError* error);

/// \brief No results are expected from AdbcStatementExecute.  Pass
///   NULL to out.
#define ADBC_OUTPUT_TYPE_NONE 0
/// \brief Arrow data is expected from AdbcStatementExecute.  Pass
///   ArrowArrayStream* to out.
#define ADBC_OUTPUT_TYPE_ARROW 1
/// \brief Partitions are expected from AdbcStatementExecute.  Pass
///   size_t* to out to get the number of partitions, and use
///   AdbcStatementGetPartitionDesc to get a partition.
///
/// Drivers are not required to support partitioning.  In that case,
/// AdbcStatementExecute will return ADBC_STATUS_NOT_IMPLEMENTED.
#define ADBC_OUTPUT_TYPE_PARTITIONS 2

If we decide to add native support for "affected row IDs", that could also go here. Otherwise I don't expect that we'd need more enum variants here.

@lidavidm
Copy link
Member

Ok, I think we'll want to tweak the partitioning interface further after this refactor, see #68.

Also I suppose this supersedes the 'affected row count' issue in #55.

lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 17, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 18, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 18, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 19, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 19, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 22, 2022
lidavidm added a commit to lidavidm/arrow-adbc that referenced this issue Aug 24, 2022
@lidavidm
Copy link
Member

@zeroshade just to follow up, does the linked PR look reasonable (at least interface-wise)?

@krlmlr
Copy link
Collaborator

krlmlr commented Aug 25, 2022

For the record, SQLite now also has RETURNING: https://www.sqlite.org/lang_returning.html . I'm in favor of descoping "last inserted ID" for now.

How is the caller supposed to know the size of the void* out buffer? Should it be void** out instead?

I suspect the type of out depends on output_type? Would it be worth having separate methods per output type?

@lidavidm
Copy link
Member

out depends on output_type, yes, so it's either ArrowArrayStream*, NULL, or some structure that I haven't fleshed out yet in the case of partitioned results. I'm mostly indifferent to whether it's a single function or multiple functions.

@lidavidm lidavidm mentioned this issue Aug 25, 2022
12 tasks
@paleolimbot
Copy link
Member

It seems like this would be cleaner as multiple functions (as Kirill noted) by output type? Maybe:

ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteVoid(struct AdbcStatement* statement, struct AdbcError* error);
ADBC_EXPORT AdbcStatusCode AdbcStatementExecute(struct AdbcStatement* statement, struct ArrowArrayStream* stream, struct AdbcError* error);
ADBC_EXPORT AdbcStatusCode AdbcStatementExecuteParitioned(struct AdbcStatement* statement, void* something, struct AdbcError* error);

@lidavidm
Copy link
Member

Alright, I'll split into three functions then.

@lidavidm
Copy link
Member

Updated the PR.

lidavidm added a commit that referenced this issue Aug 26, 2022
* [Format][C][Java][Python] Simplify execute/query interface

Fixes #61.

* Update vendored nanoarrow

* [C] Split Execute
@lidavidm lidavidm added this to the 0.1.0 milestone Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants