Skip to content
This repository has been archived by the owner on Jun 29, 2021. It is now read-only.

Feature/faster join #148

Closed

Conversation

tomatophantastico
Copy link
Contributor

A simple nested loop join implementation to deal with METAMODEL-1144.

This is an okay-ish solution:
it maintains the results of the joins, but wont scale too well
no join ordering or similar

@kaspersorensen
Copy link
Contributor

I can't see how this is faster than the previous solution? To me it looks like a refactor of it, so it must be same speed.

I mean, the refactor may have some merit, but just trying to ensure if I get it right. Is there something about this change that would make it faster?

@tomatophantastico
Copy link
Contributor Author

Actually this should be slower than the previous solution as i removed the join ordering and changed the way the filter/ join condition is evaluated.

Both made some assumptions i was not sure if they are right, so i try to go a safer route instead.

On the other hand, this implementation plays nice with the existing testcases. The existing testcases check for string equivalence of DataSet.toString().

@kaspersorensen
Copy link
Contributor

So we should close this PR then, if it's going to be slower?

@tomatophantastico
Copy link
Contributor Author

I do not think the changes i proposed about a year ago were merged, as they broke some tests. Or did i miss sth here?

The proposed change is still ~10 times faster, depending on the dataset size.

@kaspersorensen
Copy link
Contributor

Sorry, I can't remember your previous change :-) What I am asking for is if you can point out why this is faster. I'm maybe just not seeing it.

@tomatophantastico
Copy link
Contributor Author

The previous method materializes the cartesian product and then filters it.
The proposed changes applies the filters in the during the joining.

@kaspersorensen
Copy link
Contributor

Ah yes you're right. That should give a good boost in performance in deed!

Functionally the code looks good to me. I am not 100% sure about what I think about the JoinHelper class. It's only made up of static methods which means that all it's methods could just as well be directly part of MetaModelHelper. And MetaModelHelper is IMO already a kind of a junkdrawer of static methods, so I don't think I want another :-) On the other hand, something that would maybe be nice is if you made the JoinHelper class non-static so that you would instantiate it like this:

DataSetJoiner joiner = new DataSetJoiner(fromItemWithJoin);
DataSet dataSet = joiner.join(dataSets);

(or something quite similar to this)

I think then we would have a class that's quite nicely built and scoped such that it cannot be used out of context (statically). So that would be my suggestion for a more pretty design :-)

@LosD
Copy link
Contributor

LosD commented Jul 24, 2017

My 2 cents regarding helpers (without having looked at the actual code): While any kind of utility or helper class is an anti pattern, if the JoinHelper centers around the same subject and MetaModelHelper does not, it's much preferable to keep them separate. IMHO, we'd want to remove methods from MetaModelHelper (to the point it ceases to be at some point), not add to it.

@tomatophantastico
Copy link
Contributor Author

Imho, would not refer to utility classs in general as an anti-pattern, you really have to look at it use-case dependent.

In this case however, i agree that Metamodelhelper should be replaced.

MetamodelHelper appears to do two things:
(a)
Execute some relation operation on the QueryPostprocessDataContext (order/group/join/...)
(b)
Deal with the usage of array/List/Iterable in apis

I suggest for clearing things up, i move (a) first in a separate static class. From there on more profound ways for executing those operations can be defined.

For dealing with (b): Some of them are not used in MM anyway, so they might be as well deleted in MM 6.x.x, may deprecate them in MM 5.x.x?
For the other, i would really prefer to clean up the api and switch to Collection/Lists/Sets and Streams instead of arrays and iterables instead of those conversion/filtering methods.
This is also rather big, so MM 6.x.x

@tomatophantastico
Copy link
Contributor Author

Ok, this is literally meddling with the core of Metamodel.

I'll refactor this pull request, such that is as minimally invasive as possible: The API doc also guarantees MetaModelHelper to be stable.

For more experimental changes, i'll create an other pull request.

@kaspersorensen
Copy link
Contributor

The point about using collections instead of arrays is a good old wishlist item :-)
https://issues.apache.org/jira/browse/METAMODEL-7

If you're up for it, I am 100% behind it. But it's a LOT of work I think.

assertEquals(9, results.size());
assertTrue(results.contains("Row[values=[f, b]]"));
assertTrue(results.contains("Row[values=[f, a]]"));
assertTrue(results.contains("Row[values=[o, r]]"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're not asserting on order anymore, we should probably assert on the existance of all 9 records. Otherwise we've somehow easened the assertion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true.

There are a lot of Arrays.toString() comparisons in other test, which should not be there.

@kaspersorensen
Copy link
Contributor

This PR looks very good to me. I'm gonna go ahead and format the files slightly and then merge it.

@asfgit asfgit closed this in ddfa13e Jul 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants