More efficient comparison of large resultsets #10

benilovj · 2013-02-11T23:32:16Z

make CompareStoredQueries more efficient than O(n²): solved in case of comparing sorted sets
add a way to hide successfully matched rows (so that only failures appear in the test results and the results aren't too bloated)

The text was updated successfully, but these errors were encountered:

marcusrehm · 2014-01-07T18:45:57Z

Hi Jake,

There is a plan to develop this feature in a near future?
We are starting to use DbFit to test our DW/BI enviroment and manipulate large datasets is very important.

You guys have been made a great work with this tool! Congrats!

Best Regards,
Marcus

benilovj · 2014-01-07T21:37:07Z

Hi Markus,

Unfortunately, probably not in the near future. I'd of course accept pull requests and would be happy to give pointers about how this could be implemented...

Regards,
Jake

marcusrehm · 2014-01-08T10:17:16Z

Ok Jake! So I'm gonna fork the project and try to get up the vm for test enviroment. If I have any questions about that where can I ask for help?

You said that you can help pointing how this could be implemented, I already have a look at CompareStoredQueries class and saw that the algorithm is an O(n²) is that right?

Regards,
Marcus

benilovj · 2014-01-08T10:35:17Z

If I have any questions about that where can I ask for help?

Probably the forum is the best place - that way, either I or @javornikolov can jump in and help, and it's there for the benefit of others too.

Alternatively, if you find bugs in the setup/documentation, a GitHub issue/pull request is probably the most appropriate.

I already have a look at CompareStoredQueries class and saw that the algorithm is an O(n²) is that right

Sadly, yes, that's the fundamental problem. Another issue is not having a way to suppress successfully matched rows. I'll have a look at the code and add to this issue if I can think of more things that might be needed.

marcusrehm · 2014-01-09T15:21:12Z

I created a pull request for the "suppress succesfully matched rows".

marcusrehm · 2014-01-19T16:24:49Z

add a way to hide successfully matched rows (so that only failures appear in the test results and the results aren't too bloated)

@javornikolov / @benilovj
As we are completing the item above. You guys think we can work now on the algorithms improvement?

javornikolov · 2014-01-20T00:05:31Z

@marcusrehm, the hiding of matched rows I think is good enough for now.

Some thoughts of things which may have some impact here:

I started an attempt to decouple the matching logic from FitNesse in PR Decouple results matching logic of CompareStoredQueries from FitNesse #213. Ideally this should make things easier to test and also allow switching between different algorithms. It's still quite messy and will undergo some cleanup but you may take a look too and see if it makes sense and what's would be beneficial to make it easier to implement new matching algorithms.
One thing which I'm a bit uncomfortable with current implementation: it relies on having key fields (i.e. - no duplicates on them). I haven't faced scenario where I couldn't workaround that limitation but it may worth it to have in mind that some day we may try to remove this restriction.
We can think of ideas of more performant comparison of larger resultsets. (I also wonder how big sizes we want to address).

javornikolov · 2014-01-20T09:55:39Z

One idea here which would be easy to try: current algorithm + having order by clause in the stored resultsets.

marcusrehm · 2014-01-20T12:39:54Z

Yes, I'm already using the order by on queries to speed up comparisons. I will look for a case where I can test it with a large number of rows to check difference between times.

I started an attempt to decouple the matching logic from FitNesse in PR #213. Ideally this should make things easier to test and also allow switching between different algorithms. It's still quite messy and will undergo some cleanup but you may take a look too and see if it makes sense and what's would be beneficial to make it easier to implement new matching algorithms.

What I can think right now is that would be good if we could have a partial match of resultsets where we could set the test to pass green ignoring missing rows from query 2 for example. This would be nice if you thinking in cases like ETL loads from Stage to ODS or DW. What do you think about it?

We can think of ideas of more performant comparison of larger resultsets. (I also wonder how big sizes we want to address).

About resultsets size, I was thinking in DW proccess so a good starting point would be around 100.000 rows.

marcusrehm · 2014-01-28T10:57:11Z

Hi @benilovj / @javornikolov ,

As posted on Decouple results matching logic of CompareStoredQueries from FitNesse, I've been working with the algorithms and for this I created a DataRowIndexer class, it's in my compare-stored-queries-as-matcher branch (marcusrehm@176c958).

Atual:
- Assertions: 27 right, 7 wrong, 0 ignored, 0 exceptions (4,774 seconds)
- Assertions: 3938 right, 185 wrong, 0 ignored, 0 exceptions (122,588 seconds)
- Assertions: 5943 right, 277 wrong, 0 ignored, 0 exceptions (187,049 seconds)
- Assertions: 2232922 right, 0 wrong, 0 ignored, 0 exceptions (161,574 seconds)
HashMap:
- Assertions: 27 right, 7 wrong, 0 ignored, 0 exceptions (5,273 seconds)
- Assertions: 3938 right, 185 wrong, 0 ignored, 0 exceptions (110,233 seconds)
- Assertions: 5943 right, 277 wrong, 0 ignored, 0 exceptions (163,415 seconds)
- Assertions: 2232922 right, 0 wrong, 0 ignored, 0 exceptions (159,660 seconds)
TreeMap:
- Assertions: 27 right, 7 wrong, 0 ignored, 0 exceptions (5,055 seconds)
- Assertions: 3938 right, 185 wrong, 0 ignored, 0 exceptions (111,090 seconds)
- Assertions: 5943 right, 277 wrong, 0 ignored, 0 exceptions (163,445 seconds)

I think we can stay with it by now and work at the remodeling in #213. For now I think that a Factory pattern to load the Indexer should be interesting so we can instantiate it on Compared Stored Queries.

Another point that @javornikolov asked was to run these tests with the order by clause in the backend, so I will try to do it today or tomorrow and put the results here.

javornikolov · 2014-01-28T12:03:22Z

One a bit more elaborated experiment for the order by case: we can try one more algorithm which actually relies on having inputs sorted (this will allow some optimizations on top of current implementation).

Basically - this is sort merge join: http://sybaseblog.com/2011/01/28/joins-algorithms/ which is basically O(N+M) + time for sorting. The trick here is that backend may be able to sort quite fast.

marcusrehm · 2014-02-03T02:43:36Z

Hi @javornikolov / @benilovj ,

I'm developing matching algorithms abstraction and I got a question about Options class, I'm not getting the way I can retrieve an option value after a call like Options.setOption("DefaultRowIndexer", "HashMap");. Can't find a Options.getOption("DefaultRowIndexer"); method. Am I missing something?

About the tests I was doing, I think finishing it will be more easier to run them because I will be able to choose the algorithm with a |set defaultrowindexer|hashmap| for example.

javornikolov · 2014-02-03T08:29:59Z

Yes, currently Options is always resolving to boolean. I'm not sure yet where would be the best place to finally configure this. Anyway - Options seems a possible place to start with, will need more generic set/get or dedicated setDefaultMergingAlgorithm(value).

benilovj · 2014-02-03T08:31:31Z

Question: do we really need the merging algorithm to be configurable? Why would we want to have more than 1?

javornikolov · 2014-02-03T09:24:45Z

The way I see it - the initial idea is to experiment and compare the algorithms we have. For @marcusrehm seems such a switch may be helpful for some of these experiments.

I'm not sure yet for exposing such an option to end users. But in general what's the best algorithm usually depends on amount of data on both sides, amount of mismatches, distribution and ordering of data.

marcusrehm · 2014-02-03T11:42:12Z

@benilovj for now it would help a lot in testing different algorithms implementations and as @javornikolov said, in some cases end users can benefit from the option to change the algorithm. I had a case for example where an acceptance test gets ~1:30 minutes to ends, now imagine if we can tune the tests by choosing the best algorithm, it can save some time in a whole set of tests.

benilovj · 2014-02-03T12:07:53Z

some cases end users can benefit from the option to change the algorithm

what are those cases? I don't see them at the moment, I just see the overhead of maintaining more complex code and more complex documentation

now imagine if we can tune the tests by choosing the best algorithm

as I've said before, I believe that prod tests in DbFit are a bad idea, I don't want to encourage users to tinker with comparison algorithms, that time is better spent improving the test itself

in any case, any switcher won't be part of the first implementation of the feature, let's start with one before considering moving on to multiple.

javornikolov · 2014-02-03T12:31:31Z

I think the typical case with DbFit should be tests on top of very small amounts of data (though the number of tests themselves may be large). For such scenarios: from the measurements which @marcusrehm has performed - the existing implementation works a bit faster than the new ones using HashMap and TreeMap.

My initial bias is towards not harming the typical case in favour of cases with large data sets.

One idea is what I suggested as additional approach: something like Compare Ordered Queries: this should be able to work efficiently for small sets too since it doesn't add the overhead of maintaining indexes or hash tables.

marcusrehm · 2014-02-03T16:11:25Z

what are those cases? I don't see them at the moment, I just see the overhead of maintaining more complex code and more complex documentation

@benilovj , I see your point. That is not what DbFit was created for and I understand it perfectly. Other point that i agree with you is that:

as I've said before, I believe that prod tests in DbFit are a bad idea, I don't want to encourage users to tinker with comparison algorithms, that time is better spent improving the test itself

I am not talking about use DbFit at production neither for performance tests. I totally agree with you on this, but I think the issue about the amount of data is related to the nature of what we are testing. If we are testing transactional applications, maybe two or five thousands of rows could be a good upper limit case, but for data warehouse applications this amount of rows could not reflect a subset of a whole case that the application should deal with.

We can try this way: I will do the change in Options class just to make the tests in my enviroment and once we decide which algorithm should stay I can revert this modification and commit the main changes.

What you guys think?

javornikolov · 2014-02-03T18:05:47Z

We can try this way: I will do the change in Options class just to make the tests in my enviroment

This was my initial understanding. Once we have more data collected from experiments: we could judge better if a single algorithm is good enough for all the scenarios which we want to cover or not.

javornikolov · 2014-02-03T18:48:56Z

If we are testing transactional applications, maybe two or five thousands of rows could be a good upper limit case, but for data warehouse applications this amount of rows could not reflect a subset of a whole case that the application should deal with.

I don't think it's something specific for DW systems. What would prevent you from splitting that into multiple smaller tests working on different smaller subsets? Same may prevent you in OLTP world too.
I can imagine a situation when larger set of data may be desired:

Regression testing when lacking (good) understanding of the existing (old) behaviour. So we store large enough sample which is considered to be representative and compare against it. This can be easy to setup but of course has drawbacks:
- we postpone dealing with the real problem: understanding/describing the behaviour of the system so that we can split into smaller tests
- there is a risk that even though large, the sample could be not really a representative one and possibly still not covering all the interesting cases

benilovj · 2014-02-03T18:51:11Z

@javornikolov beautifully put, my thoughts exactly.

marcusrehm · 2014-02-04T00:35:27Z

Guys, I think we are defending the same point of view, maybe, as my english is not so good, I didn't express myself in a better way.

Once we have more data collected from experiments: we could judge better if a single algorithm is good enough for all the scenarios which we want to cover or not.

Yes, and that is true, nothing changes.

I can imagine a situation when larger set of data may be desired:
Regression testing when lacking (good) understanding of the existing (old) behaviour.

we postpone dealing with the real problem: understanding/describing the behaviour of the system so that we can split into smaller tests

This is exactly what I got here. Our job with DbFit consists in 2 tasks:

Validate the ETL to bring data from a legacy system (with almost no documentation) to the data warehouse, and for it we are doing unit tests with smaller subsets. That is ok.
Second is validate the data that is coming from the legacy system and create the acceptance tests. As business users do not trust in all information that is stored in legacy system (a considerable subset of it is outdated) they created a process where each subsidiary sends a monthly worksheet with its employees information, who was hired, who was fired, promoted and so on and these worksheets are consolidated in our central office. These worksheets together have approximately 10.000 employees. The only approach that seemed good was match employees data in this consolidated worksheet against the employees data from legacy system. Now imagine a test that tries to match 10.000 x 10.000 records?

javornikolov · 2014-02-22T12:30:21Z

@marcusrehm, now that we're done with #213 I wonder how it compares performance-wise to the state before this redesign (with the default and with the alternative indexing strategies). Would you be able to run some tests to measure that? And then we can go on choosing what would be the best default algorithm and whether it makes sense to have ability to switch it.

marcusrehm · 2014-02-26T03:42:35Z

Hi @javornikolov / @benilovj ,

I did the tests again with the new implementation and what I can conclude is that the actual algorithm has better results when ordered resultsets are used. But when the case is with unordered resultsets the HashMap got a better performance.

Maybe if with could ensure that rows are sorted at some time before comparison we could guarantee the observed behavior. Also, using an ordered list we could improve a row search stopping it right after algorithm reaches the n+1 key group.

Any thoughts?

Actual
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,919 seconds)
  - Assertions: 5904 right, 290 wrong, 0 ignored, 0 exceptions (189,857 seconds)
- Ordered:
  - Assertions: 5904 right, 290 wrong, 0 ignored, 0 exceptions (168,001 seconds)
  - Assertions: 2237620 right, 0 wrong, 0 ignored, 0 exceptions (156,769 seconds)
HashMap
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,935 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (167,081 seconds)
- Ordered:
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,171 seconds)
  - Assertions: 2238470 right, 0 wrong, 0 ignored, 0 exceptions (170,528 seconds)
TreeMap
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,997 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,409 seconds)
- Ordered:
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,389 seconds)
  - Assertions: 2238470 right, 0 wrong, 0 ignored, 0 exceptions (163,913 seconds)

javornikolov · 2014-03-01T12:08:24Z

@marcusrehm, thanks for the measurements. Why in some of the cases we have different number of right/wrong?

Also, using an ordered list we could improve a row search stopping it right after algorithm reaches the n+1 key group.

I've been also thinking about similar optimizations. The thing here is that we ended need to guarantee sorted resultsets, otherwise the outcome will be incorrect. Which is possible but I just wonder if it's worth the effort since this will only improve the cases where we have failures.

In general I think it's more important to have the tests executing quickly in case when they're passing. The failures don't seem so critical to me in terms of performance - ideally they shouldn't stay red for too long. What do you think?

javornikolov · 2014-03-01T12:12:40Z

Some ideas if we end up with having something which relies to have sorted incoming rows:

It could be some kind of option when declaring the test (options:sorted_inputs=yes)
could be explicit sorting in our code. (However this is very likely to be way slower than database order by)

javornikolov · 2014-03-01T12:17:14Z

And a bit more different approach which however may need having several database-specific implementations: if we have two queries - we can generate a query which compares the results in the database.

marcusrehm · 2014-03-03T17:55:38Z

Why in some of the cases we have different number of right/wrong?

@javornikolov , I took tests between two days so the numbers of records were different. I took it again and results are bellow:

Actual
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,919 seconds)
  - Assertions: 5904 right, 290 wrong, 0 ignored, 0 exceptions (189,857 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (165,701 seconds)
- Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,529 seconds)
  - Assertions: 5904 right, 290 wrong, 0 ignored, 0 exceptions (168,001 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (130,965 seconds)
HashMap
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,935 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (167,081 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (150,022 seconds)
- Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,904 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,171 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (139,807 seconds)
TreeMap
- Not Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,997 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,409 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (158,982 seconds)
- Ordered:
  - Assertions: 276 right, 0 wrong, 0 ignored, 0 exceptions (1,591 seconds)
  - Assertions: 5904 right, 580 wrong, 0 ignored, 0 exceptions (169,389 seconds)
  - Assertions: 2238910 right, 0 wrong, 0 ignored, 0 exceptions (175,339 seconds)

I agree with you, I need to focus on right tests, the wrong ones have a small lifetime, and about the options we have I think that create several database implementations will be difficult to maintain and doesn't worth the case.

It could be some kind of option when declaring the test (options:sorted_inputs=yes)

This is the best approach to me. Actually, I saw that QueryFixture use an approach similar to that one for choosing the algorithm for find rows. This way we let the user do the ordering job on database and choose when to use it to improve performance or not.
Maybe we can go this way.

javornikolov · 2014-03-04T18:25:11Z

I'm curious what would be the performance if we try some optimized version relying on ordered inputs. One thing in my mind (looking at MatchableDataTable) is: get rid of the unprocessedRows list - it's enough to just keep a pointer/iterator to the current row (marking as "processed" would be move to next).

@marcusrehm, what do you think? Would you be able to measure the performance of something like that? If it brings significant benefit: we may think of adding an option for switching the algorithm. If not - seems Actual implementation is best choice: we may just add some guidelines to the documentation.

marcusrehm · 2014-03-07T12:46:04Z

Hi @javornikolov,

Change unprocessedRows list to a iterator could help, but what else besides that you think we can do? If we have more options, sure I can try to measure performance.

Thinking that in the ordered inputs approach we could have two situations, first one is when both record sets are ordered and iterating over the second one would be just call unprocessedRows.next to compare rows. The other situation is when only the second record set is ordered and in this case when need to iterate over all unprocessed rows until n+1 key group to know if the record exists in the second record set and then we could abort the search.

In my opinion we could take the first situation where both record sets are ordered. If we go this way we need to adapt the algorithm to get first unprocessed row and the return it like in RowSetFixture. It would ending having no search for record as it should return the next unprocessed row, that in the right case should be the one to match.

I will try to do the changes this weekend and post the results here ok?

javornikolov · 2014-03-07T13:15:51Z

Hi @marcusrehm,

Change unprocessedRows list to a iterator could help, but what else besides that you think we can do?

Well, I think this is the main thing to try. And the focus is optimizing the case when both sets are matching.

In my opinion we could take the first situation where both record sets are ordered.

Yes.

I will try to do the changes this weekend and post the results here ok?

OK, great!

marcusrehm · 2014-03-10T04:01:55Z

Hi @javornikolov,

Working on the unprocessedRows iterator I realized that if we change it to an iterator, when MatchableDataTable.markProcessed is called and a DataRow is removed from collection, the iterator will lose some rows as indexes on collection changes after remove a row.

Another point at unprocessedRows is that as it is created as new LinkedList<DataRow>(dt.getRows()) it is just storing references, and not copies, of DataRow objects. Before look at it I thought that it was coping all DataRows (creating new) objects to a new collection. As it is right now, I think it isn't costly for the algorithm.

With these issues above I think the best option is let it as it is. What do you think?

About the algorithm to find and compare as we talked earlier:

In my opinion we could take the first situation where both record sets are ordered. If we go this way we need to adapt the algorithm to get first unprocessed row and the return it like in RowSetFixture.

Looking at it now I don't think it's a good choice, because when there is some error the results could lead the user to misunderstanding the real situation.

I created a simple test in this branch on AcceptanceTests.CoreTests.CommomSuite called OrderedQuery. In this test I used the Ordered Query Fixture to show the behavior we were supposing to use. Basically we get errors for rows that actually are in the result set.

I think that by now we should just update documentation to expose that ordering record sets in data base should improve performance on row searching. In the mean time I will try to change the actual algorithm to use a Comparable interface approach when ordered inputs are passed so we can stop the search on N+1 record when no row is found.

Do you agree with that?

javornikolov · 2014-03-10T10:14:02Z

Hi @marcusrehm ,

Working on the unprocessedRows iterator I realized that if we change it to an iterator, when MatchableDataTable.markProcessed is called and a DataRow is removed from collection, the iterator will lose some rows as indexes on collection changes after remove a row.

Another point at unprocessedRows is that as it is created as new LinkedList(dt.getRows()) it is just storing references, and not copies, of DataRow objects. Before look at it I thought that it was coping all DataRows (creating new) objects to a new collection. As it is right now, I think it isn't costly for the algorithm.

The idea was to get rid of the unprocessedRows and in this way there will be no need to remove via markProcessed but just advance the iterator. But as you observed - this is indeed just a shallow copy of the collection so such an optimization probably won't bring any dramatic difference.

I think that by now we should just update documentation to expose that ordering record sets in data base should improve performance on row searching.

OK

In the mean time I will try to change the actual algorithm to use a Comparable interface approach when ordered inputs are passed so we can stop the search on N+1 record when no row is found.

Well, this optimization would only benefit the scenarios with mismatches. So the question is - would it worth the effort? But you may try it and see if there is a valid use case where that's beneficial.

marcusrehm · 2014-03-25T11:41:16Z

Hi @javornikolov,

Actually I started something about it, but as you said and we already have agree, it will only benefit scenarios where we have missing rows, not even the surplus ones are in this scenarios.

Update documentation seems the only thing left.

javornikolov · 2014-04-10T21:56:27Z

@marcusrehm, thanks a lot for helping to move this forward!

marcusrehm · 2014-04-11T18:54:19Z

Thank you @javornikolov and @benilovj for the great job done with DbFit!
👍

javornikolov mentioned this issue Jan 28, 2014

Decouple results matching logic of CompareStoredQueries from FitNesse #213

Merged

11 tasks

javornikolov mentioned this issue Apr 5, 2014

Add some guidelines for comparing large data sets #264

Merged

benilovj closed this as completed in #264 Apr 6, 2014

benilovj added this to the 3.0.0 milestone May 4, 2014

More efficient comparison of large resultsets #10

More efficient comparison of large resultsets #10

Comments

benilovj commented Feb 11, 2013

marcusrehm commented Jan 7, 2014

benilovj commented Jan 7, 2014

marcusrehm commented Jan 8, 2014

benilovj commented Jan 8, 2014

marcusrehm commented Jan 9, 2014

marcusrehm commented Jan 19, 2014

javornikolov commented Jan 20, 2014

javornikolov commented Jan 20, 2014

marcusrehm commented Jan 20, 2014

marcusrehm commented Jan 28, 2014

javornikolov commented Jan 28, 2014

marcusrehm commented Feb 3, 2014

javornikolov commented Feb 3, 2014

benilovj commented Feb 3, 2014

javornikolov commented Feb 3, 2014

marcusrehm commented Feb 3, 2014

benilovj commented Feb 3, 2014

javornikolov commented Feb 3, 2014

marcusrehm commented Feb 3, 2014

javornikolov commented Feb 3, 2014

javornikolov commented Feb 3, 2014

benilovj commented Feb 3, 2014

marcusrehm commented Feb 4, 2014

javornikolov commented Feb 22, 2014

marcusrehm commented Feb 26, 2014

javornikolov commented Mar 1, 2014

javornikolov commented Mar 1, 2014

javornikolov commented Mar 1, 2014

marcusrehm commented Mar 3, 2014

javornikolov commented Mar 4, 2014

marcusrehm commented Mar 7, 2014

javornikolov commented Mar 7, 2014

marcusrehm commented Mar 10, 2014

javornikolov commented Mar 10, 2014

marcusrehm commented Mar 25, 2014

javornikolov commented Apr 10, 2014

marcusrehm commented Apr 11, 2014