Skip to content
This repository

15x - 150x faster truncation strategy for Mysql (tested with mysql2 gem) #126

Closed
stanislaw opened this Issue July 02, 2012 · 19 comments

5 participants

Stanislaw Pankevich Jonas Nicklas Ben Mabey Craig Ringer Nathan Wenneker
Stanislaw Pankevich

Hello,

Somehow, I didn't use DatabaseCleaner at all for a very long time, just using the tip №3 from Jose Valim http://blog.plataformatec.com.br/2011/12/three-tips-to-improve-the-performance-of-your-test-suite/ instead.

Today I tried to switch to using Capybara's :webkit driver instead of default :selenium for my JS tests. This led me to some issues with transactions, which always worked good for me with transactional fixtures and Jose's tip for AR shared connection. This in its turn led me to trying to switch to truncation strategy. And finally, this way I tried to use truncation within DatabaseCleaner.

Running Database.clean on teardown, I saw that it truncates all my tables, each of ~30 I have, even if I tested only one table, filling it with a couple of records. This way I discovered, what this drama about not using truncation and preferring transactions instead REALLY means. It is damn slow.

I suggest you better approach:

On my first iteration my guess was to check each table if it is empty before trying to run TRUNCATE... on it. This worked, SELECT COUNT (*) works MUCH FASTER than truncate that it is easy to do a check: if count > 0 then truncate. I was going to write issue about it here, when I remembered also about resetting AUTO_INCREMENT index and discovered this issue: #84.

I personally don't care about my next ids are set to NULL before each test, because I've never written tests that rely on such kind of consistency - this case deserves its own truncation strategy like :fast_truncation, but, of course, so general gem like 'database_cleaner' SHOULD have :truncation strategy resetting ids as default one.

Second iteration: we check SELECT COUNT (*) and then fetch AUTO_INCREMENT of table we operate on - it is done by

SELECT Auto_increment 
FROM information_schema.tables 
WHERE table_name='#{table}'

And then only execute TRUNCATE if auto_increment is > 1.

I did setup the test repository, demonstrating all this: https://github.com/stanislaw/truncate-vs-count
See commented next line. Script, as it is in repo, gives 15x time faster execution of all truncation procedure then Database.cleaner with :truncation does. If we uncomment next line we become to have it working 150x faster, but sacrificing AUTO_INCREMENT reset.

I am feeling a bit risky, writing this issue and wondering how such serious gem like database_cleaner could have lack of such simple solution based on MySQL core methods, providing its strategy with amazing performance.

Thanks.

Stanislaw Pankevich

Jonas Nicklas @jnicklas, I spent hours on Capybara's readme, particularly its "Transactions and database setup".

'Transactions and truncations' was not the easiest task to learn and now it is not the easy one to setup again and again across different environments.

Please look at this issue. I ask you, as a man, whom I owe to the half of my knowledge about TDD.

I hope, that I didn't overlook something important.

Thanks!

Jonas Nicklas

Hey @stanislaw, I don't really know much about the internals of database_cleaner. Let's hang in there and see what @bmabey says. Personally, I would be okay with sacrificing the index reset, or have that as a config option at least. Then again, I never use MySQL these days and I don't know how any of this applies to postgres.

Stanislaw Pankevich

I am sure this applies to postgres the same way. I didn't use it for some time, but if @bmabey will suggest me to write a solid PR on this, I will do the same test for postgres also.

This concept is very easy, that's why I hesitate, whether it is so cool as it seems to be.

Two selects are definitely faster than one truncate. I think, whatever adapter to use.

OK, we'll see,

Thanks for reacting so fast.

Ben Mabey
Owner
bmabey commented July 03, 2012

Hey @stanislaw, thanks for the work and interesting benchmark! I'm all for having faster truncation. The original code was modeled (i.e. copy and pasted) after Rail's truncation code and no attempt was made to make it faster so I'm glad to see you doing this. :)

It seems like your method of checking 'Auto_increment' would be a good sane default since it would preserve the current behavior and speed the common use case up. To do that we would need to modify these lines for MySQL. Ideally, we could support a similar operation for all the DBs where it makes sense like you have began working on with Postgres. If writing all this behavior for every database adapter combination and for it's DataMapper counterpart is more work than you want to do then just adding it for your needs and documenting the differences in the README would be acceptable as well.

WRT to the faster option by ignoring the auto_increment constraint... I like having that be an option as well. I think it would make most sense to have that be an option to the truncation strategy instead of a separate strategy altogether. Perhaps something like:

DatabaseCleaner.strategy = :truncation, {:only => %w[widgets dogs some_other_table], :reset_ids => false}

(Where :reset_ids would default to true.)

To support this then we would save the :reset_ids option here and then dispatch appropriately on that option here. For example, each adapter class would implement a #truncate_table_no_id_reset method in addition to #truncate_table and the option would dictate which would be called. Since this option won't be changing we could memoize the dispatch so the conditional doesn't have to be run on each #clean invocation. I would probably opt for doing this caching in the initializer.

Another surprise about your benchmark was how DatabaseCleaner was slower than doing just plain vanilla truncation. I think this may be due to the fact that DatabaseCleaner may be creating too many connections to the the DB. I have fixed that in this branch but I haven't been able to get that branch's functionality to work with the current API so I can safely deprecate the old API. I'll have to work on that some more.. Anyways, this shouldn't effect the changes you want to make to the truncation strategy.

Ben Mabey
Owner
bmabey commented July 03, 2012

As an aside, I think it would also be cool to support the hack/trick that Jose Valim points out in that blog post.

Stanislaw Pankevich stanislaw referenced this issue from a commit in stanislaw/database_cleaner July 04, 2012
Stanislaw Pankevich Fast truncation implemented for ActiveRecord with mysql or mysql2 (ad…
…dresses #126)
dbc81e7
Stanislaw Pankevich

Notes on this commit:

bundle
bundle exec spec/database_cleaner/active_record
  1. I preserved the behaviour of default #truncate_table of MySQL and MySQL2 adapters.
  2. reset_ids is set to true by default. I think option to run faster, sacrificing resets, should go for an option.
  3. IMPORTANT! The code common for both adapters seems to be duplicated, but it is not actually so, because MySQL uses older fetching mechanism (see the differences). I left possible extraction of this code into some module for you.
  4. Entire suite for ActiveRecord now runs on RSpec 2. I strongly recommend to move your tests on RSpec 2 base - it is very easy. I didn't run the other adapters, but I think it should not be a problem.
  5. Besides adding #fast_truncate_tables see also, I've implemented actual executions of queries to test truncations are run. You had them marked pending. I think design I've setup allows easily accumulate different adapter configs in spec/support/active_record/*.rb folder (like I did with mysql and mysql2).

Waiting for your review.

Stanislaw Pankevich

Notes on PostgreSQL and other adapters. I indeed did quick exploration of whether PostgreSQL could be improved in its performance the same way. Here is, what I've discovered:

  1. We can't rely on select count(*) in PG, because it is known to be very slow http://wiki.postgresql.org/wiki/Slow_Counting.
  2. There is no AUTO_INCREMENT exact implementation there. PG has its analogue, but I couldn't manage how to get it work for me.

I am still sure it could be done (it is intuitively seems easier to me to do one fast query if-check then do one truncation). I even asked it on SO http://stackoverflow.com/questions/11311079/postgresql-db-30-tables-with-number-of-rows-100-the-fastest-way-to-clean-eac, the answers stated, it is not possible except using just truncation, but I think it could be done anyway, because none of the answer contained information really proving that somehow. Also I've written similar letter to PG developers 'performance' mailing list - I'am waiting for their answer now.

This leads me to the following conslusion:

To implement the same coolness in other adapters like PG, we need to consult with specialists in these particular ORMs and DBs.

I can promise my participation in PG testing. The others, especially un-relational dbs are completely unfamiliar to me. Maybe you know some PG experts to summon here?

Stanislaw Pankevich

Ok, I found solution for Postgres as well, achieving the same results. It is on https://github.com/stanislaw/truncate-vs-count again.

I slightly improved design, so you can play with setting Ns of tables and records.

I think, it should work pretty good on all versions of PG.

I will add #fast_truncate_tables to PG adapter also soon.

UPDATE: No! Solution I found is not enough.
UPDATE2: Solution found, see https://github.com/stanislaw/truncate-vs-count

Stanislaw Pankevich

Besides preparing the pull request for both MySQL and PostgreSQL stategies, I want to ensure that queries, I run against various versions of mysql and pg, will are valid.

I finished PG part as well, making its queries as universal as possible.

Latest results are on https://github.com/stanislaw/truncate-vs-count.

Ben Mabey
Owner
bmabey commented July 06, 2012

Thanks for the patches @stanislaw, I'm reviewing them now. I am unable to run the tests with the RSpec 2 upgrade via rake. I changed the rake tasks to be:

require 'rspec/core/rake_task'
RSpec::Core::RakeTask.new(:spec)
RSpec::Core::RakeTask.new(:rcov) do |t|
  t.rcov = true
end

After that change I still the error Shared example group 'a generic strategy' already exists.

I'll look over the other changes now...

Stanislaw Pankevich

A couple of notes on: "As an aside, I think it would also be cool to support the hack/trick that Jose Valim points out in that blog post."

1) If you take a look what Cucumber::Rails::Database.javascript_strategy = :shared_connection does, you will see, that this strategy actually does this kind of trick out of Cucumber's box.

2) I use Jose's hack only for rspec testing, where are I don't do any integration tests using Capybara. So entire suite is based on transactions approach working on a single AR connection.
Also this hack worked good for me in Cucumber for a long time, before I switched Capybara.javascript_driver from :selenium to :webkit or/and :poltergeist. Having shared connection when using :webkit caused too much problems. By the way, it is the reason, why I began my effort on optimizing :truncation strategy DatabaseCleaner has. It is almost unnoticeable loss in performance, when doing fast :truncation instead of :transaction with :shared_connection, and this approach works MUCH more stable.

I hope, this makes sense.

Ben Mabey
Owner
bmabey commented July 10, 2012

Ah, I didn't know about Cucumber::Rails::Database.javascript_strategy = :shared_connection.. good to know, thanks!

Nice to see that the fast truncation performs just as well as transactions with a shared connection.

Stanislaw Pankevich

Besides running deeper in exploration of PG internals ))), I've tried to make one more epic attempt to make MySQL fast strategy for truncation even more faster. But I failed - the code in PR is the fastest.

Just to let you know about attempt and the results:

def fast_truncate_tables _tables
  return unless _tables.any?
    res = execute %{
       SELECT DISTINCT TABLE_NAME 
       FROM INFORMATION_SCHEMA.TABLES 
       WHERE TABLE_NAME IN (#{ _tables.map{|t| "'#{t}'" }.join(',') }) AND AUTO_INCREMENT > 1;
    }

  res.each{|row| truncate_table row.first}
end

I.e, I tried to operate directly on MySQL schema. Such query takes 25-30ms. It is MUCH slower than SELECT EXISTS(SELECT 1 FROM table LIMIT 1) which takes 0.2ms - for 40 tables it is no more than 8ms!

For my Rails App test suite, using this code, I had 6m34.290s, while the code which is in PR still performed on 5m9s.

Saying this, I confirm one more time, that inside truncation strategy itself when using MySQL, my code performance seems to be the fastest possible around 5 different approaches I tried during on these two weeks.

I would now bet, that PR code for MySQL is the fastest possible ever.

Craig Ringer

It looks like there may be truncation speed improvements coming to newer PostgreSQL versions - at least for the fsync=off case that's common in non-data-durable testing. I doubt the changes will land in 9.2, but I'm not sure. Discussion thread roots here: http://archives.postgresql.org/message-id/CAMkU=1yLXvODRZZ_=fgrEeJfk2tvZPTTD-8n8BwrAhNz_WBT0A@mail.gmail.com and here: http://archives.postgresql.org/message-id/50037464.7020503@ringerc.id.au

Ben Mabey
Owner
bmabey commented July 16, 2012

Thanks @ringerc for the pointers and amazingly thorough SO answer.

Stanislaw Pankevich

I am happily closing this!

Don't forget to close two year-old issues as well: 62 and 69.

Stanislaw Pankevich stanislaw closed this August 06, 2012
Nathan Wenneker

For anyone stumbling on this issue directly, take note that these ideas were implemented in #127 and are available for use now with the :pre_count and :reset_ids options to the truncation strategy. Perhaps I'm not the only one that wondered if/how this was resolved, since it was closed without a reference. Thanks! Great work on this, BTW --- really helps speed up my suite.

Stanislaw Pankevich

Yeah, sorry! Thanks for noticing that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.