-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join type promotion #1218
Join type promotion #1218
Conversation
af6a0a6
to
6b5f4af
Compare
on_left = self.on_left | ||
on_right = self.on_right | ||
|
||
right_params = self.rhs.schema[0].parameters[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can do this with self.rhs.measure.fields
, which IMO is a bit more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there doesn't always appear to be a measure attribute on this type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm weird, is that the cause of the build failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, I just installed pyspark to run this locally; however, I am running into new issues when trying to extend this to handle the join of int32 and int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can ping you when I get something working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'd install spark 1.3 for now, because I haven't worked up a PR to add support for 1.4 yet.
[name, extract_option(dt) | ||
if isinstance(dt, Option) and | ||
not isinstance(right_params[n], Option) else dt] | ||
for n, (name, dt) in enumerate(self.lhs.schema[0].parameters[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, I think this should be self.lhs.measure.fields
70cd992
to
c083b79
Compare
The build is going to fail until conda gets updated with the datashape pr. Getting 1 error locally where the order of columns are flipped. edit: After walking through the test manaully, I believe that it was making an incorrect assertion. I updated the test and added a comment with some intermediate step to make it easier for people to validate this later. |
Should be ready to go now. |
('A', 1, 5), | ||
('F', 6, 1), | ||
('F', 6, 2), | ||
('F', 6, 4)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this just wrong before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see your comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is the line that fixes the column ordering here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did not intend on fixing this, it sort of fell out of the other changes. That is why I walked through the join manually to assure myself I still had the correct answer. I think what fixed this were the checks on collections.py:446 and 450. This is because we were doing an inclusion check on a field name against a string and 'a' is in 'name' but should not have been selected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense
d0397d1
to
6894ac4
Compare
f8b46df
to
7171cf3
Compare
if name not in self.on_right] | ||
right_types = keymap( | ||
dict(zip(on_right, on_left)).get, | ||
dict(self.rhs.schema[0].parameters[0]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was it not possible to use self.rhs.measure
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, something didn't have a measure attribute, not sure what the types were here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just tried this, you can do self.rhs.dshape.measure.dict
and remove the dict()
call around it
# [3, C, 5], | ||
# [6, F, 1], | ||
# [6, F, 2], | ||
# [6, F, 4]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the joined key all the way over to the left? is that just how the join dshape pops out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we construct the pairs we emit joined + left + right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I the shape of the measure in the join puts the joined keys first
right_types = listpack(types_of_fields(on_right, rhs)) | ||
if len(left_types) != len(right_types): | ||
raise ValueError( | ||
'Length of on_left=%d not equal to lenght of on_right=%d' % ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo lenght
should be length
closes #1193
edit: this pr also does generic type promotion of joined fields.
needs blaze/datashape#172
This also fixes an issue that could occur when joining on a single column name passed as a string when other column names were substrings of the join column.