-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More efficient comparison of large resultsets #10
Comments
Hi Jake, There is a plan to develop this feature in a near future? You guys have been made a great work with this tool! Congrats! Best Regards, |
Hi Markus, Unfortunately, probably not in the near future. I'd of course accept pull requests and would be happy to give pointers about how this could be implemented... Regards, |
Ok Jake! So I'm gonna fork the project and try to get up the vm for test enviroment. If I have any questions about that where can I ask for help? You said that you can help pointing how this could be implemented, I already have a look at CompareStoredQueries class and saw that the algorithm is an O(n²) is that right? Regards, |
Probably the forum is the best place - that way, either I or @javornikolov can jump in and help, and it's there for the benefit of others too. Alternatively, if you find bugs in the setup/documentation, a GitHub issue/pull request is probably the most appropriate.
Sadly, yes, that's the fundamental problem. Another issue is not having a way to suppress successfully matched rows. I'll have a look at the code and add to this issue if I can think of more things that might be needed. |
I created a pull request for the "suppress succesfully matched rows". |
@javornikolov / @benilovj |
@marcusrehm, the hiding of matched rows I think is good enough for now. Some thoughts of things which may have some impact here:
|
One idea here which would be easy to try: current algorithm + having |
Yes, I'm already using the
What I can think right now is that would be good if we could have a partial match of resultsets where we could set the test to pass green ignoring missing rows from query 2 for example. This would be nice if you thinking in cases like ETL loads from Stage to ODS or DW. What do you think about it?
About resultsets size, I was thinking in DW proccess so a good starting point would be around 100.000 rows. |
Hi @benilovj / @javornikolov , As posted on Decouple results matching logic of CompareStoredQueries from FitNesse, I've been working with the algorithms and for this I created a DataRowIndexer class, it's in my compare-stored-queries-as-matcher branch (marcusrehm@176c958).
I think we can stay with it by now and work at the remodeling in #213. For now I think that a Factory pattern to load the Indexer should be interesting so we can instantiate it on Compared Stored Queries. Another point that @javornikolov asked was to run these tests with the |
One a bit more elaborated experiment for the Basically - this is sort merge join: http://sybaseblog.com/2011/01/28/joins-algorithms/ which is basically O(N+M) + time for sorting. The trick here is that backend may be able to sort quite fast. |
Hi @javornikolov / @benilovj , I'm developing matching algorithms abstraction and I got a question about About the tests I was doing, I think finishing it will be more easier to run them because I will be able to choose the algorithm with a |
Yes, currently |
Question: do we really need the merging algorithm to be configurable? Why would we want to have more than 1? |
The way I see it - the initial idea is to experiment and compare the algorithms we have. For @marcusrehm seems such a switch may be helpful for some of these experiments. I'm not sure yet for exposing such an option to end users. But in general what's the best algorithm usually depends on amount of data on both sides, amount of mismatches, distribution and ordering of data. |
@benilovj for now it would help a lot in testing different algorithms implementations and as @javornikolov said, in some cases end users can benefit from the option to change the algorithm. I had a case for example where an acceptance test gets ~1:30 minutes to ends, now imagine if we can tune the tests by choosing the best algorithm, it can save some time in a whole set of tests. |
what are those cases? I don't see them at the moment, I just see the overhead of maintaining more complex code and more complex documentation
as I've said before, I believe that prod tests in DbFit are a bad idea, I don't want to encourage users to tinker with comparison algorithms, that time is better spent improving the test itself in any case, any switcher won't be part of the first implementation of the feature, let's start with one before considering moving on to multiple. |
I think the typical case with DbFit should be tests on top of very small amounts of data (though the number of tests themselves may be large). For such scenarios: from the measurements which @marcusrehm has performed - the existing implementation works a bit faster than the new ones using HashMap and TreeMap. My initial bias is towards not harming the typical case in favour of cases with large data sets. One idea is what I suggested as additional approach: something like |
@benilovj , I see your point. That is not what DbFit was created for and I understand it perfectly. Other point that i agree with you is that:
I am not talking about use DbFit at production neither for performance tests. I totally agree with you on this, but I think the issue about the amount of data is related to the nature of what we are testing. If we are testing transactional applications, maybe two or five thousands of rows could be a good upper limit case, but for data warehouse applications this amount of rows could not reflect a subset of a whole case that the application should deal with. We can try this way: I will do the change in Options class just to make the tests in my enviroment and once we decide which algorithm should stay I can revert this modification and commit the main changes. What you guys think? |
This was my initial understanding. Once we have more data collected from experiments: we could judge better if a single algorithm is good enough for all the scenarios which we want to cover or not. |
I don't think it's something specific for DW systems. What would prevent you from splitting that into multiple smaller tests working on different smaller subsets? Same may prevent you in OLTP world too.
|
@javornikolov beautifully put, my thoughts exactly. |
Guys, I think we are defending the same point of view, maybe, as my english is not so good, I didn't express myself in a better way.
Yes, and that is true, nothing changes.
This is exactly what I got here. Our job with DbFit consists in 2 tasks:
|
@marcusrehm, now that we're done with #213 I wonder how it compares performance-wise to the state before this redesign (with the default and with the alternative indexing strategies). Would you be able to run some tests to measure that? And then we can go on choosing what would be the best default algorithm and whether it makes sense to have ability to switch it. |
Hi @javornikolov / @benilovj , I did the tests again with the new implementation and what I can conclude is that the actual algorithm has better results when ordered resultsets are used. But when the case is with unordered resultsets the HashMap got a better performance. Maybe if with could ensure that rows are sorted at some time before comparison we could guarantee the observed behavior. Also, using an ordered list we could improve a row search stopping it right after algorithm reaches the n+1 key group. Any thoughts?
|
@marcusrehm, thanks for the measurements. Why in some of the cases we have different number of right/wrong?
I've been also thinking about similar optimizations. The thing here is that we ended need to guarantee sorted resultsets, otherwise the outcome will be incorrect. Which is possible but I just wonder if it's worth the effort since this will only improve the cases where we have failures. In general I think it's more important to have the tests executing quickly in case when they're passing. The failures don't seem so critical to me in terms of performance - ideally they shouldn't stay red for too long. What do you think? |
Some ideas if we end up with having something which relies to have sorted incoming rows:
|
And a bit more different approach which however may need having several database-specific implementations: if we have two queries - we can generate a query which compares the results in the database. |
@javornikolov , I took tests between two days so the numbers of records were different. I took it again and results are bellow:
I agree with you, I need to focus on right tests, the wrong ones have a small lifetime, and about the options we have I think that create several database implementations will be difficult to maintain and doesn't worth the case.
This is the best approach to me. Actually, I saw that |
I'm curious what would be the performance if we try some optimized version relying on ordered inputs. One thing in my mind (looking at @marcusrehm, what do you think? Would you be able to measure the performance of something like that? If it brings significant benefit: we may think of adding an option for switching the algorithm. If not - seems Actual implementation is best choice: we may just add some guidelines to the documentation. |
Hi @javornikolov, Change unprocessedRows list to a iterator could help, but what else besides that you think we can do? If we have more options, sure I can try to measure performance. Thinking that in the ordered inputs approach we could have two situations, first one is when both record sets are ordered and iterating over the second one would be just call unprocessedRows.next to compare rows. The other situation is when only the second record set is ordered and in this case when need to iterate over all unprocessed rows until n+1 key group to know if the record exists in the second record set and then we could abort the search. In my opinion we could take the first situation where both record sets are ordered. If we go this way we need to adapt the algorithm to get first unprocessed row and the return it like in I will try to do the changes this weekend and post the results here ok? |
Hi @marcusrehm,
Well, I think this is the main thing to try. And the focus is optimizing the case when both sets are matching.
Yes.
OK, great! |
Hi @javornikolov, Working on the Another point at With these issues above I think the best option is let it as it is. What do you think? About the algorithm to find and compare as we talked earlier:
Looking at it now I don't think it's a good choice, because when there is some error the results could lead the user to misunderstanding the real situation. I created a simple test in this branch on I think that by now we should just update documentation to expose that ordering record sets in data base should improve performance on row searching. In the mean time I will try to change the actual algorithm to use a Do you agree with that? |
Hi @marcusrehm ,
The idea was to get rid of the unprocessedRows and in this way there will be no need to remove via
OK
Well, this optimization would only benefit the scenarios with mismatches. So the question is - would it worth the effort? But you may try it and see if there is a valid use case where that's beneficial. |
Hi @javornikolov, Actually I started something about it, but as you said and we already have agree, it will only benefit scenarios where we have missing rows, not even the surplus ones are in this scenarios. Update documentation seems the only thing left. |
@marcusrehm, thanks a lot for helping to move this forward! |
Thank you @javornikolov and @benilovj for the great job done with DbFit! |
CompareStoredQueries
more efficient than O(n²): solved in case of comparing sorted setsThe text was updated successfully, but these errors were encountered: