Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster joins #38

Closed
antonmks opened this issue Jul 4, 2013 · 10 comments
Closed

Faster joins #38

antonmks opened this issue Jul 4, 2013 · 10 comments

Comments

@antonmks
Copy link
Owner

antonmks commented Jul 4, 2013

I added an option to sort database segments locally on a join key. The best performance can be reached by sorting the data file on a filter key and then sorting segments locally on a join key. This way the expensive sorting of the datasets can be avoided in join operations.
So, for example, the line
STORE A INTO 'lineitem' BINARY;
now becomes
STORE A INTO 'lineitem' BINARY SORT SEGMENTS BY orderkey;

Running time of Q3 changed from 15 to 7 seconds,
Q5 from 68 to 17 seconds.

@AlexeyAB
Copy link
Contributor

AlexeyAB commented Jul 4, 2013

Good news! It sorts the data in every segment seperately, and can be used an additional to "sort -t "|" -k11 lineitem.tbl > lineitems.tbl" to sort all data by field "shipdate" which reduces time for FILTER?

And by using thrust::is_sorted() before join we can know, do we need to sort data or no, is it correct?

@antonmks
Copy link
Owner Author

antonmks commented Jul 4, 2013

Yes.
It creates a file *.sort where it stores the indexes of sorted columns. So
when doing a join it checks if a segment is already sorted by looking into
the file.
Using thrust::is_sorted probably would be more expensive, it would have to
check all data in a column.

Anton

Подольск ?

On Thu, Jul 4, 2013 at 3:44 PM, AlexeyAB notifications@github.com wrote:

Good news! It sorts the data in every segment seperately, and can be used
an additional to "sort -t "|" -k11 lineitem.tbl > lineitems.tbl" to sort
all data by field "shipdate" which reduces time for FILTER?

And by using thrust::is_sorted() before join we can know, do we need to
sort data or no, is it correct?


Reply to this email directly or view it on GitHubhttps://github.com//issues/38#issuecomment-20475062
.

@AlexeyAB
Copy link
Contributor

AlexeyAB commented Jul 4, 2013

Stored in the "*.sort-file" indexes, which uses with "thrust::gather/scatter" to produce sorted segment?

Yes, and you, Moscow?

@antonmks
Copy link
Owner Author

antonmks commented Jul 4, 2013

The sort file simply has one or more numbers like 0 - meaning that a segment is already presorted on first field.
03 - presorted on first and third fields.

Yes.

@AlexeyAB
Copy link
Contributor

AlexeyAB commented Jul 4, 2013

.sort is a metadata-file. Why not to use bitmask, an example 5(101) mean that on 1st and 3d fields is already sorted?

А как догадались, что с Подольска? :)

@antonmks
Copy link
Owner Author

antonmks commented Jul 4, 2013

Yes, it is possible to use a bitmask, although the effect on performance would be nil. It is just a list of sorted fields.

Посмотрел Google analytics, там есть города посетителей. А почему Подольск ? С такими интересами в Сан Франциско было бы интереснее и лучше :-)

@AlexeyAB
Copy link
Contributor

AlexeyAB commented Jul 4, 2013

I can't read it :)

@antonmks
Copy link
Owner Author

antonmks commented Jul 4, 2013

Updated.

@AlexeyAB
Copy link
Contributor

AlexeyAB commented Jul 4, 2013

Вариант. Но не так хорош разговорный английский, и сейчас не так далеко от Москвы, пока смотрю найдется ли здесь достойное применение. А сами не думали о Сан Франциско? :)

@antonmks
Copy link
Owner Author

antonmks commented Jul 5, 2013

Английский - ерунда, учится быстро. А мне никак по семейным обстоятельствам :-(

@antonmks antonmks closed this as completed Jul 5, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants