Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce DbDataSource abstraction to System.Data #64812

Closed
roji opened this issue Feb 4, 2022 · 36 comments · Fixed by #70006
Closed

Introduce DbDataSource abstraction to System.Data #64812

roji opened this issue Feb 4, 2022 · 36 comments · Fixed by #70006
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Data
Milestone

Comments

@roji
Copy link
Member

roji commented Feb 4, 2022

Summary

This proposes introducing a new ADO.NET abstraction called DbDataSource, which represents a data source from which (1) connections can be retrieved, and against which (2) commands can be executed directly. tl;dr:

// Provider-specific data source construction:
var dataSourceBuilder = new NpgsqlDataSourceBuilder
{
    ConnectionString = "<CONNECTION_STRING>",
    // Set authentication callbacks, auth tokens, ILoggerFactory, and any other options
};
using var dataSource = dataSourceBuilder.Build();

// Use the data source as a connection factory:
await using (var connection = await dataSource.OpenConnectionAsync())
{
    // Use the connection as usual
}

// Provider-agnostic data source construction:
using var dataSource = dbProviderFactory.CreateDataSource("<CONNECTION_STRING>");

// Get a command that executes directly against the data source, without explicitly referring to the connection
await using (var command = dataSource.CreateCommand("SELECT * FROM Blogs"))
await using (var reader = command.ExecuteReaderAsync())
{
    // Process results as usual
}

Summary of benefits

  • DbDataSource encapsulates all information and configuration needed to provide an open database connection (auth information, logging setup, type mapping config, etc.), ready for executing commands.
  • This makes it suitable for passing around and registering in DI as a factory for connections, without needing to pass around any additional information.
  • It's intended for DbDataSources to correspond to connection pools managed internally in the database driver.
    • With the current API, drivers need to look up the internal pool each time a connection is opened, by the connection string. Since DbDataSource allows user code to directly reference an (abstracted) internal pool, it eliminates that lookup and helps perf.
    • DbDataSource isn't an abstraction for a connection pool (see #24856); that would require introducing some pooling-specific APIs, is complicated, and might not be a good idea. However, it could evolve into one in the future, with external pooling implementations which can be used across ADO.NET providers.
  • It's also possible to get and execute a DbCommand directly on DbDataSource, without needing to deal with DbConnection at all. In scenarios where no connection state is required (e.g. transactions), it's not necessary to burden users with managing the connection, so this can be done under the hood. In addition, this opens up command execution models which aren't tied to a specific connection (e.g. multiplexing on Npgsql).
  • The proposal includes a shim to make the new abstraction immediately available on all ADO.NET drivers, without them needing to implement anything (unless they wish to customize behavior).

Connection instantiation and dependency injection

In ADO.NET, DbConnection implementations are typically instantiated directly, passing the connection string to the constructor:

using var connection = new NpgsqlConnection("<CONNECTION_STRING>");

This requires knowing the connection string at every site where a connection is instantiated. In addition, further authentication details are frequently needed beyond the connection string (SSL callbacks, auth tokens...), which also need to be set on the connection before it's opened. In practice, this means that user applications typically need to create some sort of factory for creating connections, to centralize this information.

Using ADO.NET with dependency injection

For example, when using ADO.NET (or Dapper) with DI, it's possible to configure a scoped connection as follows:

builder.Services.AddScoped(_ => new NpgsqlConnection(
    builder.Configuration.GetConnectionString("BloggingContext")));

However, if multiple connections are needed for any reason, then this becomes more complicated, and some sort of factory is necessary. Rather than forcing every application to build this, DbDataSource would allow the following:

builder.Services.AddSingleton(sp => NpgsqlDataSource.Create(
    builder.Configuration.GetConnectionString("BloggingContext")));

Provider-specific sugar APIs could make this nicer:

builder.Services.AddNpgsqlDataSource(
    builder.Configuration.GetConnectionString("BloggingContext"), o => o
      .UseEncryption(...)
      .SomeOtherConfig(...));

Applications can then get injected with the DbDataSource, and open connections through it. Alternatively, they can configure a scoped connection from the DbDataSource to be injected directly.

API point for connection configuration

The fact that connections are currently directly instantiated makes it difficult to add any sort of configuration APIs; the main way to tweak ADO.NET behavior is currently via connection strings, but that's not appropriate in many cases.

For example, let's suppose an ADO.NET provider wishes to add support Microsoft.Extensions.Logging. It's not possible to pass ILoggerFactory via the connection string, and while it's possible to add a DbConnection constructor overload which accepts ILoggerFactory, that would mean re-instantiating all the required loggers each time a connection is created - which isn't acceptable for perf reasons. With DbDataSource, the logging factory could be assigned directly on the provider's DbDataSource implementation, and all connection retrieved from that DbDataSource would use it.

In a similar vein, Npgsql has an extensible plugin system for modifying type mappings. This allows users to read PostgreSQL date/time values as NodaTime types, instead of as built-in .NET types such as DateTime. Similarly, database spatial types can be read as types from the NetTopologySuite library. At this point, Npgsql plugins can be enabled either globally (affecting all connections in the program), or for a specific NpgsqlConnection instance (generally inappropriate for perf reasons). It would be more appropriate to enable such plugins at the DbDataSource level, allowing multiple type mapping schemes to exist in the same applications without any perf overhead, etc.

Other possible features enabled by DbDataSource could be centrally specifying SQL statements to be prepared on all physical connections, SQL to execute on every physical connection when it's first opened (e.g. to set some state variable), etc.

It's suggested that ADO.NET drivers expose builders - e.g. NpgsqlDataSourceBuilder - which can configure all provider-specific aspects of the data source. Once a data source has been built, it should generally be immutable with respect to configuration options since connections may already have been handed out. An exception to this would probably be the rotation of authentication tokens.

DbProviderFactory

ADO.NET already contains an abstraction called DbProviderFactory, which can be used to create connections and other ADO.NET objects. However, this abstraction only serves for writing database-agnostic code that isn't specific to a certain driver:

using var connection = dbProviderFactory.CreateConnection();

The resulting DbConnection has no connection string or any other information on it; DbProviderFactory is nothing more than an abstraction for instantiating the provider-specific DbConnection implementation. It is generally a singleton type which represents the driver, as opposed to a database, a connection pool or similar.

We could theoretically retrofit DbProviderFactory with the above new APIs instead of introducing a new DbDataSource type. However, mixing the two concerns could be confusing, aswe'd then have two types of DbProviderFactory instances:

  1. Those which were created with connection information/configuration (new kind), correspond to a database and can produce open, ready-to-use connections
  2. Those which weren't (existing kind), and correspond to the driver as a whole. They can only produce blank connections which require configuration before they can be opened

As a result, I'm proposing to keep DbProviderFactory and DbDataSource separate, and to add a new CreateDataSource method on the DbProviderFactory abstraction, to allow DbDataSource to be created and used in database-agnostic code.

Executing commands directly on DbDataSource

Beyond being just a factory for connections, we can allow executing commands directly on DbDataSource:

await using var command = dataSource.CreateCommand("SELECT ...");
command.Parameters.Add(...);
var result = await command.ExecuteScalarAsync();

The main advantage here is API usability: in many (most?) scenarios, there's no need for users to concern themselves with connections - they simply want to execute a query against a database. Any scenarios which involve connection state (e.g. transaction) do require explicitly handling connections, but in the common case where no connection state is involved, we can simplify things. In typical DI scenarios, it's expected that a DbDataSource get injected, and that most commands would be executed directly against it.

Note that this isn't a Dapper-like improvement of ADO.NET: users still deal with the standard DbCommand API, add parameters and read results as usual. In fact, Dapper could add extension methods for better usability over DbDataSource just as it does over DbConnection.

A second advantage is to open up execution models which bypass the concept of a connection - this is what Npgsql does with multiplexing. In this model, commands are provided to the driver for execution on any arbitrary connection; the driver is free to coalesce unrelated commands together for better efficiency, and to continue sending commands on connections even when they're already busy, allowing better utilization of pooled connections. Executing a command directly on a DbDataSource is the right API for this kind of execution model.

To make this available on all ADO.NET providers without requiring changes, the default implementation of DbDataSource.CreateCommand would return a DbCommand wrapper over the native DbCommand returned by the driver's DbProviderFactory.CreateCommand. DbCommand.ExecuteReader (and other execution methods) would be overridden to open the connection (retrieved from DbDataSource.GetConnection) before execution, and to close it afterwards (e.g. with CommandBehavior.CloseConnection). Since the connection is owned by the command in this scenario, accessing DbCommand.Connection and Transaction should raise an exception.

Additional notes

  • When instantiating a DbConnection directly and when pooling is enabled, the ADO.NET driver must locate the internal connection pool based on the connection string. This typically involves an internal dictionary lookup on the connection string, but if any additional information can affect pooling, the situation can become more complicated. DbDataSource eliminates this perf overhead by doing any setup once at startup, and from that point the user directly references the internal pool through the abstraction.
  • DbDataSource is disposable, so that disposing may close all connections and resources associated with them, similar to the semi-standard ClearPool API that exists in SqlClient, Npgsql, Microsoft.Data.SQLite, etc.
    • A disposed DbDataSource is expected to throw ObjectDisposedException from GetConnection. However, connections which have already been handed out should continue working, and can themselves be disposed.
  • Since the provider's regular DbConnection instances are handed out by DbDataSource, users may call Close and Open on them. While this is not the recommended usage pattern, this would work and have the same effect as if the connection were instantiated directly. That is, if Close is called on a DbConnection that was acquired from a DbDataSource that represents a connection pool, the connection would be returned to the pool as usual.
  • When users instantiate two pooling DbDataSource instances with the same connection strings, those DbDataSources would typically (but not necessarily) represent two distinct pools. This isn't the case when using the shim implementation, where all connections with the same connection string are managed via the same pool.
  • Naming-wise:
    • There's a DataSource property on DbConnection, which contains an arbitrary provider-specific string representing the database of the connection. Introducing DbDataSource seems reasonably consistent with that.
    • There's also DbDataSourceEnumerator which returns a DataTable with various information on visible databases. Logically it's the same concept of a DataSource, though, so that makes sense.
  • Unfortunately we can't use covariant return types on DbDataSource, since we have async APIs returning ValueTask<T> which isn't covariant. So we use the old ADO.NET "covariance" pattern (protected virtual plus public non-virtual that gets replaced in implementations).

API proposal

public abstract class DbDataSource : IDisposable, IAsyncDisposable
{
    public abstract string ConnectionString { get; }

    protected abstract DbConnection GetDbConnection();

    protected virtual DbConnection OpenDbConnection()
    {
        var connection = GetDbConnection();
        connection.Open();

        return connection;
    }

    protected virtual async ValueTask<DbConnection> OpenDbConnectionAsync(CancellationToken cancellationToken = default)
    {
        var connection = GetDbConnection();
        await connection.OpenAsync(cancellationToken);

        return connection;
    }

    protected virtual DbCommand CreateDbCommand(string? commandText = null)
    {
        var connection = GetDbConnection();
        var command = connection.CreateCommand();

        // Return a DbCommand wrapper over connection and command, which takes care of opening/closing the connection around executions.
    }

    protected virtual DbBatch CreateDbBatch()
    {
        // Similar to CreateDbCommand...
    }

    public DbConnection GetConnection() => GetDbConnection();
    public DbConnection OpenConnection() => OpenDbConnection();
    public ValueTask<DbConnection> OpenConnectionAsync() => OpenDbConnectionAsync();
    public DbCommand CreateCommand(string? commandText = null) => CreateDbCommand();
    public DbBatch CreateBatch() => CreateDbBatch();

    public virtual void Dispose()
    {
    }

    public virtual ValueTask DisposeAsync()
        => default;
}

For provider-agnostic code, we'd add the following on System.Data.Common.DbProviderFactory:

public class DbProviderFactory
{
    public virtual DbDataSource CreateDataSource(string connectionString)
        => new DefaultDataSource(this, connectionString);
}

To make this work on existing providers, the default implementation of CreateDataSource would return an internal "shim" implementation implemented which simply gets connections from the DbProviderFactory that created it:

internal sealed class DefaultDataSource : DbDataSource
{
    private readonly DbProviderFactory _dbProviderFactory;
    private readonly string _connectionString;

    internal DefaultDataSource(DbProviderFactory dbProviderFactory, string connectionString)
        => (_dbProviderFactory, _connectionString) = (dbProviderFactory, connectionString);

    public override string ConnectionString => _connectionString;

    protected override DbConnection GetDbConnection()
    {
        var connection = _dbProviderFactory.CreateConnection();
        if (connection is null)
        {
            throw new InvalidOperationException("DbProviderFactory returned a null connection");
        }

        connection.ConnectionString = _connectionString;

        return connection;
    }

    // TODO: Command execution
}

Edit history

Date Modification
2022-02-04 Initial proposal
2022-02-07 Add note on multiple DbDatasources, connection strings and pools. Clarified that DbDataSource.ConnectionString should be non-nullable,
2022-05-05 Made DbDataSource.Dispose (and async) virtual no-op instead of abstract. Added missing public non-virtual DbDataSource.CreateCommand() and CreateBatch(). Added optional commandText parameter to CreateCommand for usability.
@roji roji added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Data labels Feb 4, 2022
@roji roji added this to the 7.0.0 milestone Feb 4, 2022
@roji roji self-assigned this Feb 4, 2022
@ghost
Copy link

ghost commented Feb 4, 2022

Tagging subscribers to this area: @roji, @ajcvickers
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

This proposes introducing a new ADO.NET abstraction called DbDataSource, which represents a data source from which (1) connections can be retrieved, and against which (2) commands can be executed directly. tl;dr:

// Provider-specific data source construction:
var dataSourceBuilder = new NpgsqlDataSourceBuilder
{
    ConnectionString = "<CONNECTION_STRING>",
    // Set authentication callbacks, auth tokens, ILoggerFactory, and any other options
};
using var dataSource = dataSourceBuilder.Build();

// Use the data source as a connection factory:
await using (var connection = await dataSource.OpenConnectionAsync())
{
	// Use the connection as usual
}

// Provider-agnostic data source construction:
using var dataSource = dbProviderFactory.CreateDataSource("<CONNECTION_STRING>");

// Get a command that executes directly against the data source, without explicitly referring to the connection
await using (var command = dataSource.CreateCommand("SELECT * FROM Blogs"))
await using (var reader = command.ExecuteReaderAsync())
{
	// Process results as usual
}

Summary of benefits

  • DbDataSource encapsulates all information and configuration needed to provide an open database connection (auth information, logging setup, type mapping config, etc.), ready for executing commands.
  • This makes it suitable for passing around and registering in DI as a factory for connections, without needing to pass around any additional information.
  • It's intended for DbDataSources to correspond to connection pools managed internally in the database driver.
    • With the current API, drivers need to look up the internal pool each time a connection is opened, by the connection string. Since DbDataSource allows user code to directly reference an (abstracted) internal pool, it eliminates that lookup and helps perf.
    • DbDataSource isn't an abstraction for a connection pool (see #24856); that would require introducing some pooling-specific APIs, is complicated, and might not be a good idea. However, it could evolve into one in the future, with external pooling implementations which can be used across ADO.NET providers.
  • It's also possible to get and execute a DbCommand directly on DbDataSource, without needing to deal with DbConnection at all. In scenarios where no connection state is required (e.g. transactions), it's not necessary to burden users with managing the connection, so this can be done under the hood. In addition, this opens up command execution models which aren't tied to a specific connection (e.g. multiplexing on Npgsql).
  • The proposal includes a shim to make the new abstraction immediately available on all ADO.NET drivers, without them needing to implement anything (unless they wish to customize behavior).

Connection instantiation and dependency injection

In ADO.NET, DbConnection implementations are typically instantiated directly, passing the connection string to the constructor:

using var connection = new NpgsqlConnection("<CONNECTION_STRING>");

This requires knowing the connection string at every site where a connection is instantiated. In addition, further authentication details are frequently needed beyond the connection string (SSL callbacks, auth tokens...), which also need to be set on the connection before it's opened. In practice, this means that user applications typically need to create some sort of factory for creating connections, to centralize this information.

Using ADO.NET with dependency injection

For example, when using ADO.NET (or Dapper) with DI, it's possible to configure a scoped connection as follows:

builder.Services.AddScoped(_ => new NpgsqlConnection(
    builder.Configuration.GetConnectionString("BloggingContext")));

However, if multiple connections are needed for any reason, then this becomes more complicated, and some sort of factory is necessary. Rather than forcing every application to build this, DbDataSource would allow the following:

builder.Services.AddSingleton(sp => NpgsqlDataSource.Create(
    builder.Configuration.GetConnectionString("BloggingContext")));

Provider-specific sugar APIs could make this nicer:

builder.Services.AddNpgsqlDataSource(
    builder.Configuration.GetConnectionString("BloggingContext"), o => o
      .UseEncryption(...)
      .SomeOtherConfig(...));

Applications can then get injected with the DbDataSource, and open connections through it. Alternatively, they can configure a scoped connection from the DbDataSource to be injected directly.

API point for connection configuration

The fact that connections are currently directly instantiated makes it difficult to add any sort of configuration APIs; the main way to tweak ADO.NET behavior is currently via connection strings, but that's not appropriate in many cases.

For example, let's suppose an ADO.NET provider wishes to add support Microsoft.Extensions.Logging. It's not possible to pass ILoggerFactory via the connection string, and while it's possible to add a DbConnection constructor overload which accepts ILoggerFactory, that would mean re-instantiating all the required loggers each time a connection is created - which isn't acceptable for perf reasons. With DbDataSource, the logging factory could be assigned directly on the provider's DbDataSource implementation, and all connection retrieved from that DbDataSource would use it.

In a similar vein, Npgsql has an extensible plugin system for modifying type mappings. This allows users to read PostgreSQL date/time values as NodaTime types, instead of as built-in .NET types such as DateTime. Similarly, database spatial types can be read as types from the NetTopologySuite library. At this point, Npgsql plugins can be enabled either globally (affecting all connections in the program), or for a specific NpgsqlConnection instance (generally inappropriate for perf reasons). It would be more appropriate to enable such plugins at the DbDataSource level, allowing multiple type mapping schemes to exist in the same applications without any perf overhead, etc.

Other possible features enabled by DbDataSource could be centrally specifying SQL statements to be prepared on all physical connections, SQL to execute on every physical connection when it's first opened (e.g. to set some state variable), etc.

It's suggested that ADO.NET drivers expose builders - e.g. NpgsqlDataSourceBuilder - which can configure all provider-specific aspects of the data source. Once a data source has been built, it should generally be immutable with respect to configuration options since connections may already have been handed out. An exception to this would probably be the rotation of authentication tokens.

DbProviderFactory

ADO.NET already contains an abstraction called DbProviderFactory, which can be used to create connections and other ADO.NET objects. However, this abstraction only serves for writing database-agnostic code that isn't specific to a certain driver:

using var connection = dbProviderFactory.CreateConnection();

The resulting DbConnection has no connection string or any other information on it; DbProviderFactory is nothing more than an abstraction for instantiating the provider-specific DbConnection implementation. It is generally a singleton type which represents the driver, as opposed to a database, a connection pool or similar.

We could theoretically retrofit DbProviderFactory with the above new APIs instead of introducing a new DbDataSource type. However, mixing the two concerns could be confusing, aswe'd then have two types of DbProviderFactory instances:

  1. Those which were created with connection information/configuration (new kind), correspond to a database and can produce open, ready-to-use connections
  2. Those which weren't (existing kind), and correspond to the driver as a whole. They can only produce blank connections which require configuration before they can be opened

As a result, I'm proposing to keep DbProviderFactory and DbDataSource separate, and to add a new CreateDataSource method on the DbProviderFactory abstraction, to allow DbDataSource to be created and used in database-agnostic code.

Executing commands directly on DbDataSource

Beyond being just a factory for connections, we can allow executing commands directly on DbDataSource:

await using var command = dataSource.CreateCommand("SELECT ...");
command.Parameters.Add(...);
var result = await command.ExecuteScalarAsync();

The main advantage here is API usability: in many (most?) scenarios, there's no need for users to concern themselves with connections - they simply want to execute a query against a database. Any scenarios which involve connection state (e.g. transaction) do require explicitly handling connections, but in the common case where no connection state is involved, we can simplify things. In typical DI scenarios, it's expected that a DbDataSource get injected, and that most commands would be executed directly against it.

Note that this isn't a Dapper-like improvement of ADO.NET: users still deal with the standard DbCommand API, add parameters and read results as usual. In fact, Dapper could add extension methods for better usability over DbDataSource just as it does over DbConnection.

A second advantage is to open up execution models which bypass the concept of a connection - this is what Npgsql does with multiplexing. In this model, commands are provided to the driver for execution on any arbitrary connection; the driver is free to coalesce unrelated commands together for better efficiency, and to continue sending commands on connections even when they're already busy, allowing better utilization of pooled connections. Executing a command directly on a DbDataSource is the right API for this kind of execution model.

To make this available on all ADO.NET providers without requiring changes, the default implementation of DbDataSource.CreateCommand would return a DbCommand wrapper over the native DbCommand returned by the driver's DbProviderFactory.CreateCommand. DbCommand.ExecuteReader (and other execution methods) would be overridden to open the connection (retrieved from DbDataSource.GetConnection) before execution, and to close it afterwards (e.g. with CommandBehavior.CloseConnection). Since the connection is owned by the command in this scenario, accessing DbCommand.Connection should raise an exception.

Additional notes

  • When instantiating a DbConnection directly and when pooling is enabled, the ADO.NET driver must locate the internal connection pool based on the connection string. This typically involves an internal dictionary lookup on the connection string, but if any additional information can affect pooling, the situation can become more complicated. DbDataSource eliminates this perf overhead by doing any setup once at startup, and from that point the user directly references the internal pool through the abstraction.
  • DbDataSource is disposable, so that disposing may close all connections and resources associated with them, similar to the semi-standard ClearPool API that exists in SqlClient, Npgsql, Microsoft.Data.SQLite, etc.
    • A disposed DbDataSource is expected to throw ObjectDisposedException from GetConnection. However, connections which have already been handed out should continue working, and can themselves be disposed.
  • Since the provider's regular DbConnection instances are handed out by DbDataSource, users may call Close and Open on them. While this is not the recommended usage pattern, this would work and have the same effect as if the connection were instantiated directly. That is, if Close is called on a DbConnection that was acquired from a DbDataSource that represents a connection pool, the connection would be returned to the pool as usual.
  • Naming-wise, there's a DataSource property on DbConnection, which contains an arbitrary provider-specific string representing the database of the connection. Introducing DbDataSource seems reasonably consistent with that.
  • Unfortunately we can't use covariant return types on DbDataSource, since we have async APIs returning ValueTask<T> which isn't covariant. So we use the old ADO.NET "covariance" pattern (protected virtual plus public non-virtual that gets replaced in implementations).

API proposal

public abstract class DbDataSource : IDisposable, IAsyncDisposable
{
    public abstract string ConnectionString { get; } // TODO: Nullable?

    protected abstract DbConnection GetDbConnection();

    protected virtual DbConnection OpenDbConnection()
    {
        var connection = GetDbConnection();
        connection.Open();

        return connection;
    }

    protected virtual async ValueTask<DbConnection> OpenDbConnectionAsync(CancellationToken cancellationToken = default)
    {
        var connection = GetDbConnection();
        await connection.OpenAsync(cancellationToken);

        return connection;
    }

    protected virtual DbCommand CreateDbCommand()
    {
        var connection = GetDbConnection();
        var command = connection.CreateCommand();

        // Return a DbCommand wrapper over connection and command, which takes care of opening/closing the connection around executions.
    }

    // protected virtual DbCommand CreateDbBatch() ...

    public DbConnection GetConnection() => GetDbConnection();
    public DbConnection OpenConnection() => OpenDbConnection();
    public ValueTask<DbConnection> OpenConnectionAsync() => OpenDbConnectionAsync();

    public abstract void Dispose();

    public abstract ValueTask DisposeAsync();
}

For provider-agnostic code, we'd add the following on System.Data.Common.DbProviderFactory:

public class DbProviderFactory
{
    public virtual DbDataSource CreateDataSource(string? connectionString) => new DefaultDataSource(this, connectionString);
}

To make this work on existing providers, the default implementation of CreateDataSource would return an internal "shim" implementation implemented which simply gets connections from the DbProviderFactory that created it:

internal sealed class DefaultDataSource : DbDataSource
{
    private readonly DbProviderFactory _dbProviderFactory;
    private readonly string _connectionString;

    internal DefaultDataSource(DbProviderFactory dbProviderFactory, string connectionString)
        => (_dbProviderFactory, _connectionString) = (dbProviderFactory, connectionString);

    public override string ConnectionString => _connectionString;

    protected override DbConnection GetDbConnection()
    {
        var connection = _dbProviderFactory.CreateConnection();
        if (connection is null)
        {
            throw new InvalidOperationException("DbProviderFactory returned a null connection");
        }

        connection.ConnectionString = _connectionString;

        return connection;
    }

    // TODO: Command execution
}

Edit history

Date Modification
2022-02-04 Initial proposal
Author: roji
Assignees: roji
Labels:

api-suggestion, area-System.Data

Milestone: 7.0.0

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Feb 4, 2022
@roji
Copy link
Member Author

roji commented Feb 4, 2022

@roji roji removed the untriaged New issue has not been triaged by the area owner label Feb 4, 2022
@pinkfloydx33
Copy link

This would be great. I am getting tired of implementing factories for producing connections... though I often need to create a connection that varies based on something variable (known at runtime) other than the actual connection string. Almost something like a named data source that could be pre-registered and requested ala httpclient.

To explain, we have a single application, but multiple tenant databases. My factories often are built around named retrieval and the factory is able to build a connection using the correct connection string for the customer I'm working with. It would be great if this could somehow be supported natively, especially with DI.

@roji
Copy link
Member Author

roji commented Feb 5, 2022

@pinkfloydx33 thanks for the feedback. As a connection factory abstraction, a single DbDataSource isn't meant to handle multiple databases - that requires you to specify which database (or tenant) you want when you request a connection (and the connection objects themselves can't be pooled together if they're connected to different databases).

Of course, you can have a tenant manager service in front of multiple DbDataSource instances, each corresponding to a single connection string (and your tenant manager service would be responsible for instantiating a DbDataSource when a connection to a new tenant is requested). In other words, I'd expect multi-tenant multi-database applications to have a DbDataSource per tenant.

@bgrainger
Copy link
Contributor

I like it; I think a lot of existing code has to build their own concept of a factory for DI, even if it's as simple as .AddScoped(_ => new NpgsqlConnection(...)) example you gave.

// Provider-agnostic data source construction:
using var dataSource = dbProviderFactory.CreateDataSource("<CONNECTION_STRING>");

I'm not yet convinced that this is a good API. For symmetry with other DbProviderFactory methods, I would expect just CreateDataSource(). However, that implies that the DbDataSource.ConnectionString property would be mutable, which would be a bad idea; keeping DbDataSource immutable is good. Would it be better to standardise the DbDataSourceBuilder abstract base class, and have DbProviderFactory.CreateDataSourceBuilder()?

Secondly, the fact that a connection string is the only way to customise the DbDataSource (through this API) seems to imply that most/all useful properties can be set through the connection string, which this proposal already admits is a weakness of the current ADO.NET design.

Finally, it's hard to imagine a scenario where providerFactory.CreateDataSource(connectionString) provides any concrete benefit over var connection = providerFactory.CreateConnection(); connection.ConnectionString = connectionString;. (And the latter code would end up being written anyway, since it wouldn't require .NET 7.0.) Am I missing something?

public abstract string ConnectionString { get; } // TODO: Nullable?

This should be non-null. Even if a driver can be used with all defaulted values in the connection string, that could be expressed as “”. And 99.9% of the time, there will at least be a remote server hostname (non-SQLite) or local file name (SQLite).

It's intended for DbDataSources to correspond to connection pools managed internally in the database driver.

Connection pools are usually implicit by connection string (“This typically involves an internal dictionary lookup on the connection string”). If DbDataSource is 1-1 with a connection pool (which seems like a typical, simple implementation) it may be worth explicitly clarifying that there is no requirement that new ExampleConnection(connectionString) and new ExampleDataSource(connectionString) share the same physical pool. (Even two different DbDataSources with the exact same connection string might have separate pools.) The ADO.NET shim would imply an equivalence (because it has to use the underlying provider's logic, which will usually create a shared pool for equal connection strings), but a custom implementation in a provider might find it simpler to keep them separate. The documentation should call this out and warn users not to rely on connection pools being separate per DbDataSource or being shared, unless provider-specific documentation explicitly states otherwise. (Or that a connection pool even exists; some providers may just not support pooling at all.)

Naming-wise, there's a DataSource property on DbConnection, which contains an arbitrary provider-specific string representing the database of the connection. Introducing DbDataSource seems reasonably consistent with that.

This actually seems a little confusing to me; coming to the API with fresh eyes, I would expect the DataSource property to be a way to get back to the DbDataSource that created the connection, i.e., Assert.Equal(dataSource, dataSource.GetConnection().DataSource). (Obviously the existing property can't be changed.)

Should getting back to the DbDataSource from a connection be possible? One can imagine DB-independent code that, given a DbConnection, would like to create a second, “identical” connection. AFAIK, this is typically done by casting the DbConnection to ICloneable, for libraries that support that interface.

@Wraith2
Copy link
Contributor

Wraith2 commented Feb 5, 2022

keeping DbDataSource immutable is good. Would it be better to standardise the DbDataSourceBuilder abstract base class, and have DbProviderFactory.CreateDataSourceBuilder()

If DbDataSource is immutable it's really just a type that encapsulates the two properties on a connection string config setting, the value and the provider type, isn't it? I think I've mentioned before that I'm not a fan of passing connection strings around because they're missing the really important information about the provider. It also encourages people to manipulate them as strings when i think using the ConnectionStringBuilder class is a better idea. Is it worth considering getting a correctly types connection string builder from the DbDataSource?

It might be valuable to provide extensions to Microsoft.Extensions configuration readers so that DbDataSources can be created by users directly from config files rather than making them resolve the types themselves. So provide a DbDataSource dataSource = config.GetDbDataSource("name"); and then users can pass that around.

@bgrainger
Copy link
Contributor

If DbDataSource is immutable it's really just a type that encapsulates the two properties on a connection string config setting, the value and the provider type, isn't it?

On that abstract base class, perhaps yes. But it's highly likely that derived types will add properties for LoggerFactory and callbacks for client certificates or rotating passwords (e.g., Amazon IAM).

It also encourages people to manipulate them as strings when I think using the ConnectionStringBuilder class is a better idea. Is it worth considering getting a correctly typed connection string builder from the DbDataSource.

FWIW, DbConnectionStringBuilder had some performance issues last time I profiled it (albeit a few years ago): mysql-net/MySqlConnector#238. I'd personally prefer not to have it part of the DbDataSource API.

@FransBouma
Copy link
Contributor

Am I right in that the main purpose of this class is solely dependency injection support? As to me every data access library under the sun has implemented code which supports what this class covers, so not sure what core problems it is solving, besides DI related ones?

@bgrainger
Copy link
Contributor

A simple implementation (for discussion purposes) is in MySqlConnector:

@roji
Copy link
Member Author

roji commented Feb 7, 2022

Hey @FransBouma,

Am I right in that the main purpose of this class is solely dependency injection support? As to me every data access library under the sun has implemented code which supports what this class covers, so not sure what core problems it is solving, besides DI related ones?

Well, not really :) I tried to make the other arguments above, but here's an attempt at a much tighter summary:

  • DI is definitely a big target, but there are other non-DI applications where you need a connection factory.
  • Yes, data access libraries like EF Core and presumably LLBLGen Pro have typically already have a solution, but this also applications not using those libraries. When doing either raw ADO.NET or Dapper usage, this is an unsolved problem.
  • Beyond being a simple factory (which anyone can do), DbDataSource also aims to solve the ADO connection configuration problem: Microsoft.Extensions.Logging support, type mapper plugins, pre-configuration of prepared SQL statements (for all physical connections), and other similar stuff.
  • Finally, there's the aspect of executing commands directly against the DbDataSource, which would have 2 benefits:
    • Stop forcing users from dealing with DbConnection for state-less (e.g. transaction-less scenarios); this is a usability improvement for people doing raw ADO/Dapper (assuming Dapper implement extensions over DbDataSource)
    • Allow alternative command execution models which aren't tied to a specific connection (see Npgsql's multiplexing above).

So this isn't something that will directly help everyone, and end users using EF Core/LLBLGen Pro probably won't care, though it could still be important for the interaction between the data layer and the ADO provider (e.g. how the EF Core layer manages type mapper plugins in the ADO layer). Hope that helps explain the goals here...

@roji
Copy link
Member Author

roji commented Feb 7, 2022

@bgrainger

I'm not yet convinced that this is a good API. For symmetry with other DbProviderFactory methods, I would expect just CreateDataSource(). However, that implies that the DbDataSource.ConnectionString property would be mutable, which would be a bad idea; keeping DbDataSource immutable is good. Would it be better to standardise the DbDataSourceBuilder abstract base class, and have DbProviderFactory.CreateDataSourceBuilder()?

That's a good question. I think that symmetry with other DbProviderFactory methods (i.e. parameterless methods) isn't necessarily very important, whereas I agree the immutability of DbDataSource (at least with respect to the connection string) is critical (otherwise what happens when the connection string is modified after connections have been handed out etc.). It so happens that existing ADO.NET abstractions (connection, command) are mutable, whereas we want the one we're introducing to be immutable.

I think a DbDataSourceBuilder abstraction would have little value.. The only connection configuration detail that's cross-database is the connection string; once you want to touch anything else you have to code against provider-specific API anyway - and at that point there's no real reason to use DbProviderFactory (just new up your provider's DbDataSource builder, or whatever mechanism they have for creating a DbDataSource). In other words, if we had DbProviderFactory.CreateDataSourceBuilder, the only thing you can do with the builder it returns is to set the connection string, and then get the DbDataSource; may as well just provide the connection string directly on the DbProviderFactory method, no?

Secondly, the fact that a connection string is the only way to customise the DbDataSource (through this API) seems to imply that most/all useful properties can be set through the connection string, which this proposal already admits is a weakness of the current ADO.NET design.

Sure, though I think this a somewhat inherent limitation of wanting to write database-agnostic code, and at the same time tweak provider-specific settings. Doing this via the connection string may work well enough for simple settings (configure an int/bool thing), but becomes problematic for more complex stuff (register SSL certificate validation code, set an ILoggerFactory...). Do you have other ideas here?

Finally, it's hard to imagine a scenario where providerFactory.CreateDataSource(connectionString) provides any concrete benefit over var connection = providerFactory.CreateConnection(); connection.ConnectionString = connectionString;. (And the latter code would end up being written anyway, since it wouldn't require .NET 7.0.) Am I missing something?

Well, first, the DbDataSource could still be e.g. registered once in DI, whereas if you're creating connections and setting connection strings on them every time, you have multiple parts of your program needing to reference the connection string etc.

Second, I agree that for programs which don't ever want to reference any provider-specific code, this would be of somewhat limited use; those programs would generally not be able to configure SSL callbacks or other complex provider-specific things. However, there's also another model: programs which are OK with referencing provider features for setting up the DbDataSource (via a provider-specific builder), but once setup is complete, the DbDataSource is there and can be consumed via database-agnostic code. So I see the value here less in the setup (which has to be provider-specific the moment you want to do something complex), and more in the consumption of a DbDataSource that has already been set up.

Re .NET 7.0, are you saying that this wouldn't be useful since programs would still need code that works on older versions of .NET? If so, is that objection specific to this proposal somehow? I mean, it can be said against almost any new feature in .NET 7.0 that it wouldn't be useful since programs need to target older .NET versions... Which I'm not sure is a strong argument, since end programs can typically target the latest .NET 7.0 if they want to. Maybe I've misunderstood you here though.

public abstract string ConnectionString { get; } // TODO: Nullable?

This should be non-null. [...]

Agreed, that's also in line with how the property is null-annotated on DbConnection (for reading).

If DbDataSource is 1-1 with a connection pool (which seems like a typical, simple implementation) it may be worth explicitly clarifying that there is no requirement that new ExampleConnection(connectionString) and new ExampleDataSource(connectionString) share the same physical pool.

That's an important point indeed, thanks. As you wrote, if I instantiate two DbDataSources via new, typically that would mean two separate connection pools even if they have the same connection string (this could be an advantage in some cases). And this is indeed not the case with DbDataSources acquired from the shim, since in that case there's only one pool for a given connection string. I'll add a note for that above.

This actually seems a little confusing to me; coming to the API with fresh eyes, I would expect the DataSource property to be a way to get back to the DbDataSource that created the connection, i.e., Assert.Equal(dataSource, dataSource.GetConnection().DataSource). (Obviously the existing property can't be changed.)

I agree - it's unfortunate. I almost proposed a different name to this feature because of this (e.g. DbDatabase, DbConnectionSource), but the names I came up with were inferior in various ways, and doing that just because of this property seemed like too much.

Should getting back to the DbDataSource from a connection be possible? One can imagine DB-independent code that, given a DbConnection, would like to create a second, “identical” connection. AFAIK, this is typically done by casting the DbConnection to ICloneable, for libraries that support that interface.

I think that the cloning concern should probably be handled separately, i.e. without necessarily tying it to DbDataSource (something like Clone is definitely an option). BTW another need is to clone a connection while changing a setting; in the Npgsql EF Core provider, for database health checking I clone an ADO connection but without pooling, to ensure that Open actually creates a new physical connection etc. So I have a provider-specific thing for that, but I'm not aware of anyone asking for this in general...

Apart from that, I can see how getting back to the DbDataSource could be useful (though not sure of concrete scenarios yet). But it would also not be possible to shim that without wrapping the connection, which I'd rather avoid unless there's a good reason to do so. But am open to ideas to the contrary!

@roji
Copy link
Member Author

roji commented Feb 7, 2022

@Wraith2

If DbDataSource is immutable it's really just a type that encapsulates the two properties on a connection string config setting, the value and the provider type, isn't it? I think I've mentioned before that I'm not a fan of passing connection strings around because they're missing the really important information about the provider. It also encourages people to manipulate them as strings when i think using the ConnectionStringBuilder class is a better idea. Is it worth considering getting a correctly types connection string builder from the DbDataSource?

Can you give a concrete example scenario for this? Since DbDataSource needs to be immutable, I assume you're describing a case where some application code needs to know whether e.g. the connection string for some DbDataSource contains some option or not?

For that specific scenario (which seems quite rare, at least I haven't seen requests for it)... In theory I could see the use of exposing a DbConnectionStringBuilder on DbDataSource. However, this would have to be immutable too, since changing that DbConnectionStringBuilder cannot impact the DbDataSource; and DbConnectionStringBuilder is a completely mutable class. Note also that DbDataSource doesn't make this better or worse - people already have to do deal with this today, where a connection string is externally-specified. If they need to, they can always parse the connection string themselves (by constructing a DbConnectionStringBuilder), and then check the setting etc.

Hopefully I've understood your question...

It might be valuable to provide extensions to Microsoft.Extensions configuration readers so that DbDataSources can be created by users directly from config files rather than making them resolve the types themselves. So provide a DbDataSource dataSource = config.GetDbDataSource("name"); and then users can pass that around.

As @bgrainger wrote, the idea is also for providers to have their own DbDataSource implementations which support more than just connection strings (e.g. SSL callbacks), and it would be impossible to specify those in config (since .NET code is involved). But if someone does want to simply instantiate a DbDataSource out of a simple connection string, it should be very trivial to do that via the proposed DbProviderFactory.CreateDataSource method.

@bgrainger
Copy link
Contributor

Re .NET 7.0, are you saying that this wouldn't be useful since programs would still need code that works on older versions of .NET? If so, is that objection specific to this proposal somehow?

That was indeed a weak argument; please ignore it. 😀

@lauxjpn
Copy link

lauxjpn commented Feb 8, 2022

I like the idea of having an object that encapsulates all database specific (in contrast to provider/driver specific) settings and thereby functioning as a factory for a concrete database/configuration pair.


While this provides little additional benefit to users of providers that are already able to serialize/deserialize all their settings to/from a connection string, I do see the upside for providers that have extended configuration options that cannot be easily represented in a connection string.


Implicitly using this type as the boundary for connection pools is neat.


Being able to execute some simple SQL statements without much setup (and multiple using statements) would be nice.
While I am not convinced that this proposal fully solves this, it would still be simpler to use than what we currently have (assuming people easily discover this type).


The docs/intellisense would need to be very specific about whether an object returned from an Open...(), Get...() or Create...() method is expected to be disposed by the caller or not (mentioning the using statement).


It makes sense that using this abstraction should result in a slight performance boost. While this might not be significant for 98% of the userbase, it could be for high performance apps (and of course benchmarks, but benchmarks alone are probably not a viable argument to introduce another abstraction).

@roji Are there any preliminary benchmark results?


BTW, is this kind of encapsulation typical in other stacks/frameworks/languages and if so what type names do they use?

@roji
Copy link
Member Author

roji commented Feb 8, 2022

Thanks for the feedback @lauxjpn.

The docs/intellisense would need to be very specific about whether an object returned from an Open...(), Get...() or Create...() method is expected to be disposed by the caller or not (mentioning the using statement).

Agreed - and any disposable object returned from these any of this methods should be disposed as usual.

While I am not convinced that this proposal fully solves this, it would still be simpler to use than what we currently have (assuming people easily discover this type).

Sure. What do you see as still lacking here, any suggestions?

@roji Are there any preliminary benchmark results?

Not yet, and I agree the perf improvement is unlikely to be very significant. I see the perf improvement more as a nice side-benefit than a primary reason for doing this.

BTW, is this kind of encapsulation typical in other stacks/frameworks/languages and if so what type names do they use?

A big inspiration for this proposal came from the JDBC DataSource type, which is similar both in design and in naming. JDBC drivers are expected to implement their own DataSource, which doesn't do pooling; database-agnostic pooling packages (e.g. HikariCP, c3p0...) provide their own DataSource implementation overlayed upon the driver's, adding pooling (see notes above about how we could go in this direction, though I'm not sure we should).

But JDBC's DataSource is purely a connection factory (API docs). Various other languages/drivers also have a way to just "execute SQL" on a database without getting a connection first, unlocking stateless execution models as I described above. If you're interested I can dig around and show an example or two.

@lauxjpn
Copy link

lauxjpn commented Feb 8, 2022

While I am not convinced that this proposal fully solves this, it would still be simpler to use than what we currently have (assuming people easily discover this type).

Sure. What do you see as still lacking here, any suggestions?

@roji I think the remaining issue is with the existing command/data reader model. It feels just too clunky for simple data access in 2022. And whenever someone is supposed to use a using statement (e.g. for a DbCommand or DbDataReader object), it's implicitly not simple anymore.

Something like the following would be nice:

// returns something like IEnumerable<IRow>
var horsies = dataSource.ExecuteSql($"SELECT * FROM Horsies WHERE IWouldMuchRatherBeAUnicorn = {someCondition}");  

foreach (var horsie in horsies)
{
    var horsieName = horsie.GetValue<string>("Name");

    // ...
}

Some stuff would need to be figured out in regards to automatic resource disposal.

For example, query associated resources could be disposed when the parent data source gets disposed or the next ExecuteSql-like call runs on the same thread (query resources could be internally managed and reused on a thread level to avoid using statements).

Or maybe it would be simpler to manage the database resources with the IEnumerator<T>, which inherits from IDisposable.

(I have not thought through all the implications here.)


The docs/intellisense would need to be very specific about whether an object returned from an Open...(), Get...() or Create...() method is expected to be disposed by the caller or not (mentioning the using statement).

Agreed - and any disposable object returned from these any of this methods should be disposed as usual.

I think it is long overdue (unless it already exists and I missed it) to introduce a general .NET runtime attribute that can be used to annotate a method with the intended action a user is supposed to take, once the app is done with the returned IDisposable object.

(Or maybe simpler: the attribute marks those methods that return objects that should usually not be disposed by the caller.)

The IDE's IntelliSense could then help generating/fixing the caller code (e.g. adding/suggesting a using statement where it makes sense) and a static roslyn analyzer could show warnings or notices.

Because basically every time I don't new-up an IDisposable object myself, but instead get one from a method call, I have to stop what I am doing and do some research on what I am supposed to do in this particular case (or drill down into the implementation to infer what is expected of me).

@roji
Copy link
Member Author

roji commented Feb 8, 2022

Something like the following would be nice:

Ah, I see - this is the kind of proposal pushes ADO.NET more in the direction of e.g. Dapper. I don't think that's a bad idea, but I specifically tried to avoid going there, at least in this particular proposal. In other words, we can always suggest something like the above, both on DbConnection and on DbDataSource - I see this proposal as largely orthogonal to that.

Specifically to your proposal, yeah, there's lots to figure out - like how to generate parameter placeholders (i.e. #25022). Most importantly, for the above to be safe against SQL injection, it would have to use FormattableString; we've had APIs like that in EF Core and they've been problematic at least in some aspects. But I'd rather we kept that specific conversation separate, just so that we can get the above through without having to deal with the additional complexity.

I think it is long overdue (unless it already exists and I missed it) to introduce a general .NET runtime attribute that can be used to annotate a method with the intended action a user is supposed to take, once the app is done with the returned IDisposable object.

Isn't that what the IDisposable/IAsyncDisposable interfaces already do? I get your point that when you're not newing up yourself, there may be an ambiguity as to if the API handing you the IDisposable already disposes, but at least in this case it's a relatively straightforward factory API, so the fact that it's the user's responsibility to dispose seems pretty clear (though I agree it should be documented).

@lauxjpn
Copy link

lauxjpn commented Feb 8, 2022

I see this proposal as largely orthogonal to that.

I agree. And I think your proposal takes a first step into that general direction. Other functionality can always be added later.

As for my example code, it is really just a blend of a couple of things that would work nicely together. Not needing to use explicit parameters would be nice, but even just being able to execute some SQL without that tedious DbDataReader code and with as little resource management as possible would be a huge improvement.

I think that access to a general API for simple query scenarios would line-up nicely with the recent push for simpler code (like the new console app templates etc.) à la Python.

But that is not really part of this proposal and could be handled/discussed in a different one, once this one made it through.


Isn't that what the IDisposable/IAsyncDisposable interfaces already do? I get your point that when you're not newing up yourself, there may be an ambiguity as to if the API handing you the IDisposable already disposes, but at least in this case it's a relatively straightforward factory API, so the fact that it's the user's responsibility to dispose seems pretty clear (though I agree it should be documented).

Yeah, my comment is really only about the ambiguity when it comes to who should dispose a disposable object. The IDisposable/IAsyncDisposable interfaces by themselves are fine for their intended purpose.

A random ambiguity examples is RelationalDatabaseFacadeExtensions.GetDbConnection(DatabaseFacade) in EF Core. Though nowadays, it is properly documented that you usually should not dispose the returned object yourself.

Anyway, this ambiguity issue and a related attribute as a possible solution to it would need to be handled in a different proposal as well and is not specific to System.Data.

@roji
Copy link
Member Author

roji commented Feb 8, 2022

I think that access to a general API for simple query scenarios would line-up nicely with the recent push for simpler code (like the new console app templates etc.) à la Python.

I completely agree. Adding usability improvements to ADO.NET has been a long-standing discussion (not everyone agrees 😉), but I can indeed see a path forward where the raw database access API is something that's nice and simple (which it is definitely not today).

But that is not really part of this proposal and could be handled/discussed in a different one, once this one made it through.

👍

@roji
Copy link
Member Author

roji commented Feb 20, 2022

FYI will wait on this a little longer in case people have any further feedback, and then I'll flag this as ready-for-review (and flesh out the precise API and a possible implementation for the shim).

@Anderman
Copy link

Anderman commented Mar 9, 2022

Interesting. In my TDS Implement I also skipped the command and execute a Sql string directly from the connection. I think mapping to Poco or some kind of datareader /writer is possible.

When reading the columns from the TDS stream a user defined expression / function can be called to the mapping to poco or datareader. I did a hardcode poco implementation but that not necessary. A row and Columnreader is implemented here and here

@Anderman
Copy link

Anderman commented Mar 9, 2022

A columnreader give you also the possibility to a conversion / streaming before the complete row is read. A big jsoncolomn can parsed before creating a big string. Also images etc can be streamed to the client before reading a complete row

@roji
Copy link
Member Author

roji commented Mar 9, 2022

Interesting. In my TDS Implement I also skipped the command and execute a Sql string directly from the connection.

This proposal isn't about skipping the command - it's about skipping the connection. You're still executing a DbCommand, you'd just be doing it directly against a DbDataSource instead of against a DbCommand. As such, you'd still be using DbDataReader to consume the results - mapping to POCO is definitely out of scope here (and probably not a great fit for a low-level database API like ADO.NET).

A columnreader give you also the possibility to a conversion / streaming before the complete row is read. A big jsoncolomn can parsed before creating a big string. Also images etc can be streamed to the client before reading a complete row

Streaming column values without reading the complete row is already possible in ADO.NET via CommandBehavior.SequentialAccess. This allows you to get a readable Stream or TextReader for a big column, and so the entire row never needs to be buffered in memory etc.

@roji roji added api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels May 2, 2022
@terrajobst
Copy link
Member

terrajobst commented May 5, 2022

Video

  • Should use the Dispose pattern with Dispose(bool) being virtual while the others aren't
    • And same for DisposeAsync()
  • Maybe GetConnection() should be called CreateConnection() because it returns a new instance
  • Consider introducing a term that represents connection who logically don't have a connection and rename CreateCommand()
    • For example CreateLightweightCommand()
    • The key is: consumers should be aware that this method isn't a convenience for OpenConnection().CreateCommand().
public abstract class DbDataSource : IDisposable, IAsyncDisposable
{
    public DbConnection GetConnection();
    public DbConnection OpenConnection();
    public ValueTask<DbConnection> OpenConnectionAsync();

    public DbCommand CreateCommand(string? commandText = null);
    public DbBatch CreateBatch();

    public virtual void Dispose();
    public virtual ValueTask DisposeAsync();

    public abstract string ConnectionString { get; }

    protected abstract DbConnection GetDbConnection();
    protected virtual DbConnection OpenDbConnection();
    protected virtual async ValueTask<DbConnection> OpenDbConnectionAsync(CancellationToken cancellationToken = default);

    protected virtual DbCommand CreateDbCommand(string? commandText = null);
    protected virtual DbBatch CreateDbBatch();
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels May 5, 2022
@GSPP
Copy link

GSPP commented May 6, 2022

Maybe GetConnection() should be called CreateConnection() because it returns a new instance

  1. I agree with that. In the same spirit, rename OpenConnection to OpenNewConnection.

  2. I wonder if exposing the connection string is wise.
    a. Maybe some provider doesn't have one. There could be multiple, for example a round-robin provider targeting different servers each time. Or, it's a provider that doesn't contain the notion of a string as the source of configuration.
    b. What's the purpose of exposing the connection string? Can a consumer not track the connection string separately if they need access to it?
    c. A consumer is supposed to treat the data source as opaque, but a connection string is always provider specific. It cannot be interpreted without provider knowledge.
    d. On the other hand, it's a nice debugging feature.

  3. What is the ToString representation? This is important for debugging. I suggest: $"{GetType().Name}: '{ConnectionString}'". And with that, I conclude that the ConnectionString property is important to have as a debugging feature.

@Wraith2
Copy link
Contributor

Wraith2 commented May 6, 2022

Or, it's a provider that doesn't contain the notion of a string as the source of configuration.

There aren't any. Unfortunately the ADO object model forces all providers to use a connection string as the serialized form of the connection specific settings. Even if a new api were added at this point to keep the data in another way there is a 20 year legacy of using strings so we need to provide a way to get the string from it.

What's the purpose of exposing the connection string? Can a consumer not track the connection string separately if they need access to it?

It's one of the the primary pieces of information that this api is wrapping in a useful way. You can't add the connection string to DI as the string type. This type provides users (end users and code users) with two important pieces of information together, the provider factory information and the connection string that will be needed by that provider factory.

@roji
Copy link
Member Author

roji commented May 6, 2022

I wonder if exposing the connection string is wise. [...] Maybe some provider doesn't have one. There could be multiple, [...]

It's an interesting question.. As @Wraith2 says, the connection string is already baked into the ADO.NET API - DbConnection has it, etc. So at least in that sense we're in line with the rest of the API.

But a DbDataSource encompassing multiple databases (and therefore connection strings) is an important notion - there could definitely be a layering DbDataSource that accepts other, primitive DbDataSources. It's true that there wouldn't be a meaningful connection string to expose, but it's not the end of the world either - an empty string doesn't seem too problematic for this case.

I can't work up a very strong argument either way - exposing the connection string seems like it could be useful in some cases (like debugging), and for the few cases where it isn't relevant, it can be left unset... Any additional thoughts?

This type provides users (end users and code users) with two important pieces of information together, the provider factory information and the connection string that will be needed by that provider factory.

Note that as of now, the API doesn't expose a DbProviderFactory (but is itself a factory for connections). The idea is indeed for DbDataSource to abstract a database (which typically means connection string + driver), but not necessarily to provide users access to those details (as opposed to opening a connection). Also, @GSPP's comment about a layered DbDataSource holds true for DbProviderFactory as well - you could imagine a data source that hands out connections to different databases using different providers (though I admit that's a bit far-fetched).

@roji roji removed the blocking Marks issues that we want to fast track in order to unblock other important work label May 16, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 31, 2022
roji added a commit to roji/runtime that referenced this issue Jun 1, 2022
@roji
Copy link
Member Author

roji commented Jun 1, 2022

Everyone, a PR implementing DbDataSource is out: #70006. I'd appreciate a thorough review from anyone who's interested.

With regards to the API review comments (#64812 (comment)):

Maybe GetConnection() should be called CreateConnection() because it returns a new instance

  • After some thought and discussion, the PR does change the naming from GetConnection to CreateConnection, but leaves OpenConnection as-is. I don't think it's perfect and am open to discussion.
  • GetConnection could confuse some people, making them think there's only one connection (though there are many counter-examples, such as IEnumearble.GetEnumerator). CreateConnection avoids that, and it feels like OpenConnection is less susceptible to create that confusion.
  • Create is the prefix used in other places in ADO.NET (DbConnection.CreateCommand, DbProviderFactory), so we're consistent with that.
  • The API will indeed typically return new instances of DbConnection; but in pooling scenarios DbConnection is a façade which does not represent a physical pool (it's a lightweight object that can be closed and re-opened). So I think it's important to avoid naming such as OpenNewConnection which would be too heavy in the pooling context (which is the vast majority of cases).
  • Consuming users are expected to almost always use OpenConnection, and should only rarely need to get an unopened connection (i.e. when they need to do something to the connection object before opening it). So it's important to keep the OpenConnection API concise and easy - it's the primary API point for this new abstraction.

Consider introducing a term that represents connection who logically don't have a connection and rename CreateCommand()

  • The issue is that the instance returned from CreateCommand doesn't allow accessing the Connection and Transaction properties (since it owns the connection and manages it internally).
  • After thinking about it, I currently think leaving the name CreateCommand is the right thing to do:
    • Manipulating these two properties is only a small part of the command API surface - execution is the main function, and that behaves just like a normal command. This command doesn't behave fundamentally differently from a regular command - it just has two disabled properties.
    • There are already command states in which e.g. mutating Connection/Transaction shouldn't work, e.g. changing them while the command is in progress and a reader is open.
    • Introducing a new special term would add more confusion than help; an informative NotSupportedException from these properties should be sufficient.
    • Once again, CreateCommand is a main user-facing API on DbDataSource, so it's good to keep it concise and not overload it because of a restriction that won't matter in most cases.

Note: a WIP implementation in Npgsql can be found here.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 1, 2022

I read through the changes last night and had nothing to add or ask. Looks good to me.

@ithline
Copy link

ithline commented Jun 1, 2022

If the only difference between OpenConnection and CreateConnection is that the former returns already open connection, couldn't both methods be merged together with optional parameter specifying if the connection should return already open?

@terrajobst
Copy link
Member

terrajobst commented Jun 1, 2022

  • Manipulating these two properties is only a small part of the command API surface - execution is the main function, and that behaves just like a normal command. This command doesn't behave fundamentally differently from a regular command - it just has two disabled properties.

If you believe that's the right thing, I can accept that. From my own experience I'd say I have rarely (if ever) used the Connection property on a Command object; having that throw would largely be unobserved. Using transactions with commands is very common though, but the way ADO.NET works is that one usually initiates the transaction before creating the command and then passes in the transaction into the command. So given that these APIs don't accept one, that's probably not causing any confusion either.

it just has two disabled properties

Looking at the API it seems both Connection and Transaction are nullable propertieso on DbCommand. So if you say disabled I'd think it would mean that they return null and would only throw when someone tries to assign to them, is that correct?

FWIW, I think you still want to come up with a name for these kind of commands, not for the API itself, but I think this will be a very useful thing for the documentation in order to describe what these kind of commands can go. Having a concept also allows you to refer to them when throwing exceptions, e.g. "You cannot set a connection on connection-less commands". People can then search for "connection-less commands" to learn what these are and how they work. Otherwise you have to refer to them in a convoluted way, like "You cannot set a connection on commands created by DbDataSource.CreateCommand()", which might also be confusing when people get these command objects from a different place, e.g. SqlDataSource.CreateCommand() or MyBackendThingie.Command.

@roji
Copy link
Member Author

roji commented Jun 2, 2022

@Wraith2 thanks for taking the time to review!

@ithline collapsing OpenConnection and Createconnection into a single method is possible, however:

  • Providers will generally only need to override CreateConnection: OpenConnection simply calls that and adds the open. So if we collapse, all providers would need to reimplement the opening as well.
  • Users are expected to only use OpenConnection in the vast majority of cases; the only reason why they'd need an un-opened connection is if they need to tweak the connection in some way before opening it (this generally shouldn't happen much). So in their interaction with the main API, I'd rather they didn't have a bool parameter they'd need to understand and think about - just a simple OpenConnection seems best.

@roji
Copy link
Member Author

roji commented Jun 2, 2022

Looking at the API it seems both Connection and Transaction are nullable propertieso on DbCommand. So if you say disabled I'd think it would mean that they return null and would only throw when someone tries to assign to them, is that correct?

That's a good point - we can indeed make them just always return null when read (the getter currently also throws in the PR); this may unblock certain scenarios where code inspects these properties for some reason. For Connection specifically that may not be a good idea, since in the normal case a command with no connection isn't executable, but in this case it is; so code making standard assumptions would fail here, and a fail-fast exception from the getter may be better (for transaction it seems more OK). What do you think?

FWIW, I think you still want to come up with a name for these kind of commands, not for the API itself, but I think this will be a very useful thing for the documentation in order to describe what these kind of commands can go.

You're right - in my head I've been calling these "data source commands", though "connection-less commands" is probably better. Will keep this in mind for the docs.

@ithline
Copy link

ithline commented Jun 2, 2022

@roji my thought process was that OpenConnection() and CreateConnection() should mimick CreateCommand(string? commandText = null).

If the public signature would be public DbConnection CreateConnection(bool createOpen = true), then users could call just dataSource.CreateConnection() and when they would need closed connection, then they would pass false value.

From provider implementation it is a bit more work, but it would result in a cleaner public API, mainly because OpenConnection does not communicate very well (at least to me) that the method returns a new instance.

@roji
Copy link
Member Author

roji commented Jun 2, 2022

The thing is that almost nobody needs to get an un-opened connection, but everyone would be "burdened" with understanding what the optional bool parameter is about (e.g. Intellisense). This is very different from the situation with DbCommand and commandText, since every command requires a commandText in order to work. I basically want to separate/hide away the unopen scenario because very few users need it.

[...] mainly because OpenConnection does not communicate very well (at least to me) that the method returns a new instance.

I get that, though I personally don't feel that OpenConnection conveys that either way (it can be read as "open the connection" or "open a connection"); FWIW I felt the same way about GetConnection - after all, we have IEnumerable.GetEnumerator in .NET which returns a new instance.

roji added a commit that referenced this issue Jun 6, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jun 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants