Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch insert/update with Hibernate & HikariCP #758

Closed
cvaliere opened this issue Nov 2, 2016 · 12 comments
Closed

Batch insert/update with Hibernate & HikariCP #758

cvaliere opened this issue Nov 2, 2016 · 12 comments

Comments

@cvaliere
Copy link

cvaliere commented Nov 2, 2016

Hi everyone

I'm using HikariCP 2.5.1 & Hibernate 5.2.3

I try to make Hibernate batch the update/insert, but no matter what I try, I can't succeed.

Here is my XML config:

<bean id="hikariConfig" class="com.zaxxer.hikari.HikariConfig">
	<property name="poolName" value="springHikariCP" />
	<property name="driverClassName" value="org.postgresql.Driver" />
	<property name="jdbcUrl" value="${database.jdbcUrl}" />
	<property name="username" value="${database.username}" />
	<property name="password" value="${database.password}" />
	<property name="maximumPoolSize" value="10" />
</bean>

<bean id="dataSource" class="com.zaxxer.hikari.HikariDataSource" destroy-method="close">
	<constructor-arg ref="hikariConfig" />
</bean>

<tx:annotation-driven transaction-manager="transactionManager" />
<bean id="transactionManager" class="org.springframework.orm.jpa.JpaTransactionManager">
	<property name="entityManagerFactory" ref="entityManagerFactory" />
</bean>

<bean id="entityManagerFactory" class="org.springframework.orm.jpa.LocalContainerEntityManagerFactoryBean">
	<property name="persistenceUnitName" value="persistenceUnit" />
	<property name="dataSource" ref="dataSource" />
	<property name="packagesToScan" value="com.myAwesomeCompany" />
	<property name="jpaVendorAdapter">
		<bean class="org.springframework.orm.jpa.vendor.HibernateJpaVendorAdapter" />
	</property>
	<property name="jpaProperties">
		<props>
			<prop key="hibernate.dialect">org.hibernate.dialect.PostgreSQLDialect</prop>
			<prop key="hibernate.show_sql">false</prop>
			<prop key="hibernate.format_sql">false</prop>
			<prop key="hibernate.hbm2ddl.auto">${hibernate.hbm2ddl.auto}</prop>

			<!-- bulk insert and update statements -->
			<prop key="hibernate.jdbc.batch_size">200</prop>
			<prop key="hibernate.order_inserts">true</prop>
			<prop key="hibernate.order_updates">true</prop>
			<prop key="hibernate.jdbc.batch_versioned_data">true</prop>

			<!-- jdbc driver hint for number of fetched rows for select statement -->
			<prop key="hibernate.jdbc.fetch_size">200</prop>

			<!-- improve app startup performance -->
			<prop key="hibernate.temp.use_jdbc_metadata_defaults">false</prop>
		</props>
	</property>
</bean>

Then my test case is simply to create 10 simple objects, and I would expect only 1 statement, but I have 10 instead.

Apparently there is another way of setting things up (https://github.com/brettwooldridge/HikariCP/wiki/Hibernate4) but I can't find how it works.

Does someone know how to make it work?

Thanks a lot!

@jnehlmeier
Copy link

Sounds more like you are using entities not compatible with batching. My guess is your entity uses an identity id generator, e.g.

@Id
@GeneratedValue(strategy = GenerationType.IDENTITY)
private Long id;

JDBC batching will be disabled by JPA / Hibernate without notice when using this ID generation strategy because it requires asking the DB for the auto generated ID for each inserted row. You would need to use sequence or table id generator instead, so that JPA / Hibernate can allocate more than a single ID at once before actually executing the batch insert.

@cvaliere
Copy link
Author

cvaliere commented Nov 2, 2016

@jnehlmeier ty for your idea
it's not that though
here is my id:

@Id
@Column(name = "id")
@Access(AccessType.PROPERTY)
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "account_gen")
@SequenceGenerator(name = "account_gen", sequenceName = "account_seq", allocationSize = 1)
private Long id;

actually, the simplest way to help me would be to give me a working configuration :)
@brettwooldridge do you have an example of a working config?
there is this link (https://github.com/brettwooldridge/HikariCP/wiki/Hibernate4), but it's not the entire config, and I'm unable to understand how it's supposed to work

@brettwooldridge
Copy link
Owner

@cvaliere I do not have a working configuration. However, if you are using MySQL, you will not get batching behavior even if the code uses the JDBC batching API unless the MySQL rewriteBatchedStatements property is set to true.

See http://stackoverflow.com/questions/26307760/mysql-and-jdbc-with-rewritebatchedstatements-true

@cvaliere
Copy link
Author

cvaliere commented Nov 3, 2016

I am using PostGreSQL, not MySQL
Can you let this issue open, in case someone has a working config? I guess it would also be useful for other people to have
@lburgazzoli if I understand well, you are the creator of the HikariCPConnectionProvider, do you have a working complete configuration for batching statements with that?
Sorry to bother you guys, but I searched for days, and I can't find any tutorial that works...

@lburgazzoli
Copy link
Contributor

@cvaliere unfortunately no, I don't

@brettwooldridge
Copy link
Owner

brettwooldridge commented Nov 3, 2016

@cvaliere Have you tried asking on the Hibernate forums? Are there any secondary tables involved? https://hibernate.atlassian.net/browse/HHH-5797

Have you tried setting autoCommit in HikariCP to false?

What about org.hibernate.dialect.PostgreSQL9Dialect?

@brettwooldridge
Copy link
Owner

This answer:

http://dba.stackexchange.com/questions/58155/optimize-postgresql-for-a-lot-of-inserts-and-bytea-updates

Says that PgJDBC doesn't do anything useful with batching, and just runs each statement...

@cvaliere
Copy link
Author

cvaliere commented Nov 3, 2016

actually I may have misunderstood the concept of "batch inserts"
I was expecting that, if I'm inserting 2 objects

insert into account values (1,'John');
insert into account values (2,'Jane');

batching them would transform it into a single multi-insert statement
insert into account values (1,'John'), (2, 'Jane');

but, from what I read on a lot of forums, "batching" does NOT do that; it just sends the 2 inserts in a single round-trip to the database, but still with 2 statements

and, in addition to the batching, we can ask the JDBC to transform the batches into multi-insert statements, with rewriteBatchedStatements (MySQL) or reWriteBatchedInserts (PostGreSQL)

am I right?
if so, I'm not sure I understand the benefit of batching

@brettwooldridge
Copy link
Owner

brettwooldridge commented Nov 4, 2016

@cvaliere Technically, JDBC defines two kinds of "batching". One is defined by the Statement API, which simply states:

Submits a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts.

Some drivers have the ability to optimize the query as you described, but coalescing the values into a single statement. I.e. "rewriting" the batched statements.

The second kind of batching is typically much more performant and requires the use of the PreparedStatement.

PreparedStatement ps = connection.prepareStatement("INSERT INTO account VALUES (?,?)");
ps.setInt(1, 1);
ps.setString(2, "John");
ps.addBatch();
ps.setInt(1,2);
ps.setString(2, "Jane");
ps.addBatch();
...
ps.executeBatch();

Until recently, MySQL did not support server-side prepared statements, so this kind of batching couldn't work there.

Note that the PostgreSQL reWriteBatchedInserts driver property was just introduced in 9.4.1209, which was released on 15-Jul-2016. Here is the pull request discussion. Note that there are caveats in that some forms of INSERT can disable the batch rewriting, for example the use of the RETURNING keyword, or if the PreparedStatement is created with Statement.RETURN_GENERATED_KEYS.

I recommend enabling Hibernate query logging so that you can check to see what is being generated. Keep in mind that the query re-writing is taking place in the driver which is "below" Hibernate. So you might see Hibernate send several "INSERT ..." statements, but the driver might still rewrite them.

You can also enable PgJDBC debug logging by setting the driver URL property logLevel=2 (or maybe it's logLevel=debug?).

UPDATE: Also, as noted above, having autoCommit=true (the HikariCP default) will completely disable batch rewriting for almost all drivers. So be sure to set it to false.

UPDATE2: I suspect that Hibernate requires that the generated IDs be returned from the inserts, because it uses the object IDs everywhere internally, so that is basically going to disable batching on PostgreSQL, AFAICT.

@cvaliere
Copy link
Author

cvaliere commented Nov 4, 2016

@brettwooldridge thank you for all your explanations!

So, if I understand well, here are the 4 ways of making 100 INSERTs, ordered by the less performant to the most performant:

  1. Make 100 INSERTs, like a tard
  2. Make a "batch" of 100 INSERTs (why faster? because of the single round-trip to the database?)
  3. Make a "batch" of 1 prepared statement with 100 "datasets", which will still generate 100 INSERTs, but faster (why faster? because the query plan is only made once?)
  4. Make a single multi-insert of 100 datasets (why faster? because it's only 1 statement, and a statement has some overhead?)

Am I right if I say that Hibernate does the n°3? (when hibernate.jdbc.batch_size is defined)
Am I right if I say that the new reWriteBatchedInserts in PostGres Driver transforms that n°3 into n°4?

I suspect that Hibernate requires that the generated IDs be returned from the inserts, because it uses the object IDs everywhere internally, so that is basically going to disable batching on PostgreSQL,

I don't think so. Hibernate gets the ID from the sequence before the INSERT, and sends the ID inside the INSERT, so it doesn't need to retrieve the ID from the INSERT because it already has it.

I recommend enabling Hibernate query logging so that you can check to see what is being generated. Keep in mind that the query re-writing is taking place in the driver which is "below" Hibernate. So you might see Hibernate send several "INSERT ..." statements, but the driver might still rewrite them.

yes, that's why, in my tests, I don't look at the log ; instead, I wrote a TRIGGER AFTER STATEMENT on my table and see how many times it gets executed

to insert 100 objects:

  • With hibernate configuration, no matters what I do, the trigger is executed 100 times
  • With reWriteBatchedInserts, it is executed 3 times. So it seems that the rewrite works! But why 3 times and not 1? Well, I don't know, but I noticed that it's the number of "1" in the binary representation of the number of INSERTs. 100 is "1100100" in binary, so it generates 3 statements; but 128 would generate only 1; there must be a reason behind it, I don't know which one

@brettwooldridge
Copy link
Owner

@cvaliere You've got it basically right. In a well optimized JDBC driver, such as Oracle, or even going back in the day jTDS for SQL Server, a prepared statement with batching is the most performant way of mass insert through the JDBC interface. Of course, database-specific bulk loading methods are most performant but are not abstracted by JDBC.

A well optimized driver will:

  • Prepare the query. One network roundtrip to the DB.
  • Stage (buffer) data added through the addBatch() API in the driver, to minimize roundtrips.
  • Send tuples of pure data to the database when the staging buffer crosses some threshold (often internal) to avoid consuming too much client-side memory.
  • Flush the remaining tuples at executeBatch() time, and send the exec command to execute the query plan.

Less optimized drivers will, as the PgJDBC driver is doing by default, simply implement the batching API as a series of individual inserts -- basically defeating the purpose of batching.

Muddying the waters is that these less optimized drivers include optimizations that recognize multiple sequential identical inserts (same table, same columns), and transform them into single inserts with multiple tuples. This is the "rewrite" that is supported by MySQL and PostgreSQL.

Then you combine the less optimized case, where the driver is going to transform a batch insert into individual inserts, and then pump that through the optimization that recombines them into a single insert with multiple tuples, and you get a kinda in-between performance.

The thing is that there is a length limit to a SQL statement, different for every database. So, for some queries the driver might be able to combine 100 inserts into a single insert with 100 tuples, but other queries it can only combine 20 inserts into a single insert with 20 tuples -- because of the size of the data.

So, to answer the questions in your 1-4 ranking.

  1. Yes, tard-ish.
  2. Yes, faster because of fewer roundtrips.
  3. Yes, faster because of fewer roundtrips plus a single prepare/query compile. Unless the data is "too big" and the driver breaks the 100 inserts up into N batches rather than 1. Then you likely prepare the statement N times -- better than 100, worse than 1.
  4. Yes, fastest (if supported natively by the database). A single query plan, tuples streamed to the database (based on some buffer threshold) in N number of roundtrips until execution.

It might be interesting to try pgjdbc-ng, but I would strongly recommend building it from master rather than using 0.6 from a year ago.

@fresh-fx59
Copy link

fresh-fx59 commented Jan 19, 2023

May be this would be useful for someone.

Springboot, Hibernate, postgres

With settings below I have inserts performed like

insert into table (id, name) values ($1, $2), ($3, $4) ...

This ?reWriteBatchedInserts=true actually bring inserts to batch on postgres side.

application.yml

spring:
  jpa:
    properties:
      hibernate:
        generate_statistics: false
        order_inserts: true # for batch insert
        order_updates: true # for batch update
        jdbc:
          batch_size: 30 # for batch insert recommended value is between 1 and 30
          batch_versioned_data: true # for batch update
        id:
          new_generator_mappings: true
        ddl-auto: none
  datasource:
    type: com.zaxxer.hikari.HikariDataSource
    url: jdbc:postgresql://localhost:5432/db?reWriteBatchedInserts=true # batch inserts on db side

entity

@Entity
@Table(name = "test_batch_insert")
public class TestBatchInsert implements Serializable {

    private static final long serialVersionUID = 1L;

    @Id
    //https://stackoverflow.com/questions/12745751/hibernate-sequencegenerator-and-allocationsize
    @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "TestBatchInsert_SEQ")
    @SequenceGenerator(name = "TestBatchInsert_SEQ", sequenceName = "SEQUENCE_TestBatchInsert", allocationSize = 50)
    @Column(name = "id")
    private Long id;

    @Column(name = "name", unique = true)
    private String name;

}

Useful post about reWriteBatchedInserts by Vlad Mihalcea
One more post from him about INSERT UPDATE in Hibernate
Boost jpa insert performance
Batch Insert/Update with Hibernate/JPA
Hibernate not batching inserts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants