Skip to content

Add a fast and convenient deserialization API to ADO.NET #25739

@GSPP

Description

@GSPP

When using ADO.NET to read rows the deserialization cost can be very significant. This is important in scenarios where a lot of data is streamed. Here is directly executable code to demonstrate this:

    static class Program
    {
        static void Main()
        {
            DeserializationBenchmark.Run();
        }

        public static class DeserializationBenchmark
        {
            public static void Run()
            {
                using (var connection = new SqlConnection("Data Source=(local); Integrated Security=True; Initial Catalog=master;"))
                {
                    connection.Open();

                    PrepareTestData(connection);

                    RunBenchmarkLoop();
                }
            }

            static void PrepareTestData(SqlConnection connection)
            {
                using (var command = new SqlCommand(@"SELECT o1.object_id value1, o1.object_id value2, o1.object_id value3, o1.object_id value4, o1.object_id value5 INTO ##BenchmarkRows FROM sys.all_objects o1 cross join (select top 100 * from sys.all_objects o2) o2", connection))
                {
                    command.ExecuteNonQuery();
                }
            }

            static void RunBenchmarkLoop()
            {
                long count = 0;

                while (true)
                {
                    ReadData();

                    count++;

                    if (count % 10 == 0)
                        Console.WriteLine($"Completed {count} iterations.");
                }
            }

            static void ReadData()
            {
                using (var connection = new SqlConnection("Data Source=(local); Integrated Security=True; Initial Catalog=master;"))
                {
                    connection.Open();

                    using (var transaction = connection.BeginTransaction(System.Data.IsolationLevel.ReadCommitted))
                    {
                        using (var command = new SqlCommand(@"SELECT * FROM ##BenchmarkRows", connection, transaction))
                        using (var reader = command.ExecuteReader())
                        {
                            while (reader.Read())
                            {
                                var value1 = reader.GetInt32(0);
                                var value2 = reader.GetInt32(1);
                                var value3 = reader.GetInt32(2);
                                var value4 = reader.GetInt32(3);
                                var value5 = reader.GetInt32(4);
                            }
                        }

                        transaction.Commit();
                    }
                }
            }
        }
    }

This code simply deserializes a table of a few thousand rows. There are 5 int columns. The benchmark is designed to demonstrate the issue but it's not unrealistic. I also used ADO.NET best practices.

image

I might be mistaken but it seems that a lot of the time is spent calling chatty APIs on SqlDataReader. It looks like there is a lot of API overhead and comparatively little work actually parsing the network protocol. The console process uses 100% of one CPU core. SQL Server uses only about 1/3 of one CPU core. This is further (heuristic) evidence that the client is less efficient than the server.

There is no way to avoid this cost at the moment. The code is as fast as it gets. I suggest the following enhancement: Provide an API that directly deserializes into a user-provided type.

Old:

IDataReader reader = command.ExecuteReader();
while (reader.MoveNext()) {
 //Read all fields individually
}

New:

IEnumerable<MyDTO> resultRows = command.ExecuteObjects<MyDTO>();

This API could be made extremely fast because:

  • It is less chatty. ADO.NET can access it's internal buffers more directly.
  • Metadata overhead has to be incurred only once for the whole result set. Many per-row and per-field metadata accesses and validations can be eliminated.
  • ADO.NET can compile a lambda expression dynamically that pulls data from the internal byte buffers and pushes it into the user-provided type. To illustrate the point: The expression could look like this: new MyDTO() { Field1 = internalReader.GetNextFieldAsInt32(), Field2 = internalReader.GetNextFieldAsString(), ... }.
  • The lambda can be cached internally. Alternatively an API consumer would use a factory to create a "fast reader" and cache the reader.
  • The entire row can be read in one go. I believe this makes it faster to parse the TDS protocol.
  • If async is in use there will be less async overhead because the APIs are less chatty.

Possible extensions of this idea would be:

  • Allow structs as well as classes
  • Allow invoking a constructor
  • Allow sequences of the form IEnumerable<object[]> where all field values are provided as objects.
  • Add a pull-style API. Very rough API idea: readerObject.GetNextObject<MyDTO>().
  • Add a way to provide and reuse preallocated objects e.g. readerObject.GetNextObject<MyDTO>(existingObject).
  • Async support (IAsyncEnumerable or other suitable mechanisms).
  • Maybe there can be a parallel mode that hands our a ParallelQuery. I'm not optimistic this can be done.
  • Allow sequences of the form IEnumerable<MyDTO[]>. Here, ADO.NET would hand out rows in chunks. The chunk size would be chosen optimally. Probably, ADO.NET would convert each network buffer into a chunk. This would make the API even less chatty.

I understand that this proposal is a bit raw. The API that I proposed is a straw man that clearly has some flaws. I think an API in this spirit could improve efficiency a lot. It also would be very convenient. The existing pattern is extremely awkward to use. People are using light-weight wrappers like Dapper which look just like this API proposal.


The API should be so that ORMs such as Entity Framework can use it as well. This might not be trivial depending on what Entity Framework does. Here's an idea on how to support arbitrary object instantiation:

Instead of making ADO.NET figure out how to construct the new MyDTO()... part of the expression tree, let the user provide that. ADO.NET provides one Expression for each field to be deserialized. It could look like this:

var fastReader = FastReader.Create<MyDTO>(
    columnDescriptions /* contains the type and order of columns */,
    (Expression[] columnValueExpressions) =>
        Expression.New(typeof(MyDTO), columnValueExpressions) //Create new MyDTO(...) expression
});

//cache fastReader

IEnumerable<MyDTO> objects = fastReader.Read(mySqlDataReader);

In this example, the API caller can create an arbitrary deserialization expression that uses the ADO.NET provided column values in columnValueExpressions. columnValueExpressions[0] would correspond to internalReader.GetNextFieldAsInt32() in the code further above. Entity Framework could construct an expression tree that does anything it wants including creating multiple objects, calling factory methods and calling into the object state manager. All of this processing would be very close to obtaining the actual column values from internal ADO.NET data structures. No chattyness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-System.DataenhancementProduct code improvement that does NOT require public API changes/additionstenet-performancePerformance related issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions