New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strategy for inserting rows with table-values parameters #2484

Open
GSPP opened this Issue Jun 27, 2015 · 9 comments

Comments

Projects
None yet
8 participants
@GSPP

GSPP commented Jun 27, 2015

With SQL Server you can used table-values parameters to perform bulk DML very quickly and elegantly. Unfortunately, this is tedious:

  1. Create a table type
  2. Write an INSERT statement
  3. Create a DataTable or an IEnumerable
  4. Execute a command

Here is a proposal for how EF could pull off all of this transparently for inserts and deletes in SaveChanges:

Generate a table type programmatically for all tables that require it. 3 issues:

  1. The table type probably should be created in a separate transaction so that concurrent SaveChanges calls do not contend. But really this is optional because type creation is a one-time initialization.
  2. A type name must be chosen and the type's structure might change over time. This can be solved by appending a strong hash of the type's structure to it's name. That way table types are immutable. If the table structure changes a new type will be created. The hash would include column names, their order, data types, nullability and the primary key. Old TVP types are simply never cleaned up. They only arise on schema change which is rare.
  3. This requires DDL permission. EF could check on startup for these permissions. Alternatively, the feature could be opt-in.

I think all of this would work for deletes as well. Updates are trickier because there is a great variety of columns that might change or not change. Maybe EF can use a single type for all updates and simply ignore some columns for some updates.

Non-issues:

  1. Generated values (identity, defaults). The OUTPUT clause can return them elegantly.
  2. Performance. For SaveChanges calls with few rows the existing row-by-row strategy should be used. Over the network the TVP strategy is probably better starting with 2-3 rows due to roundtrip times. On the same machine I measured the threshold to be 10 for a particular workload.
  3. Semantics. I don't think the semantics of a SaveChanges call would be affected in any way. It's simply a "go faster" feature.
  4. "Do we really need this given that SqlBulkCopy exists?": TVPs can kind of compete with SqlBulkCopy. They are an integer factor slower but like 2 orders of magnitude faster than row-by-row inserts. 2 OOM go a long way. Often, this will be good enough. I think SqlBulkCopy usage will drop dramatically once this feature is available in EF.
  5. Topological ordering of DML. I don't see any issues here. Doing an entire table at once should always result in a valid topological order. Alternatively, EF could detect safe cases and fall back to row-by-row otherwise.
  6. Row order. Index uniqueness validation logically happens at the end of a DML statement. Therefore the order of rows in the TVP does not matter for correctness.

Performance benefits:

  1. Many-row inserts are much more efficient because SQL Server can "see" all rows at once and maintain all indexes optimally (usually by writing to indexes at a time in sorted order).
  2. Less round-trips. In a network a round-trip might cost 0.5ms.
  3. Less CPU on both client and server.

Who says that EF is not suitable for bulk inserts? Right now that might be the case but it does not have to be so. This would be awesome.

The point of this ticket is to present a viable plan for how this could be implemented in EF. Some inspiration for the dev team.

@ErikEJ

This comment has been minimized.

Show comment
Hide comment
@ErikEJ

ErikEJ Jun 27, 2015

Contributor

Would make a great PR! Did you test the batch update/insert feature, that is already implemented?

Contributor

ErikEJ commented Jun 27, 2015

Would make a great PR! Did you test the batch update/insert feature, that is already implemented?

@divega

This comment has been minimized.

Show comment
Hide comment
@divega

divega Jun 28, 2015

Member

@GSPP I agree with @ErikEJ, this sounds like a great PR. Currently EF Core supports batching, so things are already factored in a way that multiple operations of the same type are grouped and submitted to the server together, and that could make a TVP-based strategy easier to plug in. I would expect TVPs to have a perf advantage over what we do, but I don't know how much.

Member

divega commented Jun 28, 2015

@GSPP I agree with @ErikEJ, this sounds like a great PR. Currently EF Core supports batching, so things are already factored in a way that multiple operations of the same type are grouped and submitted to the server together, and that could make a TVP-based strategy easier to plug in. I would expect TVPs to have a perf advantage over what we do, but I don't know how much.

@ErikEJ

This comment has been minimized.

Show comment
Hide comment
@ErikEJ

ErikEJ Jun 29, 2015

Contributor

I think TVPs would only have a measurable perf advantage when inserting 100s or 1000s of rows (which is not what you would normally do with EF OLTP systems)

Contributor

ErikEJ commented Jun 29, 2015

I think TVPs would only have a measurable perf advantage when inserting 100s or 1000s of rows (which is not what you would normally do with EF OLTP systems)

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP Jun 29, 2015

@ErikEJ I understand why you might think that but measurements turn out to now support this view.

I have meant this TVP feature to be used with "bulky" inserts (100-1M rows). It is not primarily targeted at OLTP inserts (dozens of rows) but the perf benefits are visible there as well. I can only encourage you to try it because the gains can be staggering.

I'd like to quote this piece:

  1. Many-row inserts are much more efficient because SQL Server can "see" all rows at once and maintain all indexes optimally (usually by writing to indexes at a time in sorted order).

Do not underestimate the difference that a better plan can do. Per-index maintenance in bigger batches can be significantly faster than issuing new queries and round-trips all the time. Even without indexes, imagine a single-row insert (small row, no indexes). 90% of the execution time is per-statement and per-batch overhead. The useful work (the per-row work) is small. With TVPs the per-statement and per-batch overheads exist once for many rows (granted, they will be a little higher but not by much).

I will not be able to issue a PR for this but I hope that this ticket might spark one!

GSPP commented Jun 29, 2015

@ErikEJ I understand why you might think that but measurements turn out to now support this view.

I have meant this TVP feature to be used with "bulky" inserts (100-1M rows). It is not primarily targeted at OLTP inserts (dozens of rows) but the perf benefits are visible there as well. I can only encourage you to try it because the gains can be staggering.

I'd like to quote this piece:

  1. Many-row inserts are much more efficient because SQL Server can "see" all rows at once and maintain all indexes optimally (usually by writing to indexes at a time in sorted order).

Do not underestimate the difference that a better plan can do. Per-index maintenance in bigger batches can be significantly faster than issuing new queries and round-trips all the time. Even without indexes, imagine a single-row insert (small row, no indexes). 90% of the execution time is per-statement and per-batch overhead. The useful work (the per-row work) is small. With TVPs the per-statement and per-batch overheads exist once for many rows (granted, they will be a little higher but not by much).

I will not be able to issue a PR for this but I hope that this ticket might spark one!

@roji

This comment has been minimized.

Show comment
Hide comment
@roji

roji Jun 29, 2015

Contributor

Not related but FYI I was just thinking about a similar trick for PostgreSQL/Npgsql: switch to PostgreSQL's COPY (bulk data transfer) protocol for modification batches with many inserts.

Contributor

roji commented Jun 29, 2015

Not related but FYI I was just thinking about a similar trick for PostgreSQL/Npgsql: switch to PostgreSQL's COPY (bulk data transfer) protocol for modification batches with many inserts.

@ErikEJ

This comment has been minimized.

Show comment
Hide comment
@ErikEJ

ErikEJ Jun 29, 2015

Contributor

@GSPP Sounds great, I have actually implemented this for doing INSERTs in a single table with 3-10 rows, and it Works great. DDL permissions could be a showstopper, however.

Contributor

ErikEJ commented Jun 29, 2015

@GSPP Sounds great, I have actually implemented this for doing INSERTs in a single table with 3-10 rows, and it Works great. DDL permissions could be a showstopper, however.

@ChristopherHaws

This comment has been minimized.

Show comment
Hide comment
@ChristopherHaws

ChristopherHaws Jan 11, 2018

I really like the idea of adding support for TVP's! As a smaller change short term, it would be nice if there was a strategy for compressing the parameters using the current process. For example, if there are 2100 parameters getting passed, but 90% of them are just null, maybe we should just pass null as one parameter and reuse it.

For example, currently an insert might look like this:

exec sp_executesql N'SET NOCOUNT ON;
INSERT INTO [dbo].[MyTable] (Column1, Column2, Column3, Column4, Columen5) VALUES
(@p0, @p1, @p2, @p3, @p4),
(@p5, @p6, @p7, @p8, @p9);
',N'@p0 int,@p1 nvarchar(50),@p2 nvarchar(50),@p3 nvarchar(50),@p4 nvarchar(50),
    @p5 int,@p6 nvarchar(50),@p7 nvarchar(50),@p8 nvarchar(50),@p9 nvarchar(50)',
@p0=1,@p1=NULL,@p2=NULL,@p3=NULL,@p4=NULL,
@p5=1,@p6=NULL,@p7=NULL,@p8=NULL,@p9=NULL

But with "compression" it would be:

exec sp_executesql N'SET NOCOUNT ON;
INSERT INTO [dbo].[MyTable] (Column1, Column2, Column3, Column4, Columen5) VALUES
(@p0, @p1, @p1, @p1, @p1),
(@p0, @p1, @p1, @p1, @p1);
',N'@p0 int,@p1 nvarchar(50)',
@p0=1,@p1=NULL

For bulk inserts with many fields with the same values, this could make the save operation happen much faster as we could add more records per batch.

ChristopherHaws commented Jan 11, 2018

I really like the idea of adding support for TVP's! As a smaller change short term, it would be nice if there was a strategy for compressing the parameters using the current process. For example, if there are 2100 parameters getting passed, but 90% of them are just null, maybe we should just pass null as one parameter and reuse it.

For example, currently an insert might look like this:

exec sp_executesql N'SET NOCOUNT ON;
INSERT INTO [dbo].[MyTable] (Column1, Column2, Column3, Column4, Columen5) VALUES
(@p0, @p1, @p2, @p3, @p4),
(@p5, @p6, @p7, @p8, @p9);
',N'@p0 int,@p1 nvarchar(50),@p2 nvarchar(50),@p3 nvarchar(50),@p4 nvarchar(50),
    @p5 int,@p6 nvarchar(50),@p7 nvarchar(50),@p8 nvarchar(50),@p9 nvarchar(50)',
@p0=1,@p1=NULL,@p2=NULL,@p3=NULL,@p4=NULL,
@p5=1,@p6=NULL,@p7=NULL,@p8=NULL,@p9=NULL

But with "compression" it would be:

exec sp_executesql N'SET NOCOUNT ON;
INSERT INTO [dbo].[MyTable] (Column1, Column2, Column3, Column4, Columen5) VALUES
(@p0, @p1, @p1, @p1, @p1),
(@p0, @p1, @p1, @p1, @p1);
',N'@p0 int,@p1 nvarchar(50)',
@p0=1,@p1=NULL

For bulk inserts with many fields with the same values, this could make the save operation happen much faster as we could add more records per batch.

@NetTecture

This comment has been minimized.

Show comment
Hide comment
@NetTecture

NetTecture May 3, 2018

Not only for NULL. You can also do so for other distinct values. Is it possible to see where a value comes from? I often join multiple tables iwth basically the same condition (owner, tenant) and get multiple parameters generated.

NetTecture commented May 3, 2018

Not only for NULL. You can also do so for other distinct values. Is it possible to see where a value comes from? I often join multiple tables iwth basically the same condition (owner, tenant) and get multiple parameters generated.

@GSPP

This comment has been minimized.

Show comment
Hide comment
@GSPP

GSPP May 8, 2018

This is generated seen as bad practice or risky because it causes additional query plans to be generated.

GSPP commented May 8, 2018

This is generated seen as bad practice or risky because it causes additional query plans to be generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment