Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join type promotion #1218

Merged
merged 15 commits into from
Sep 4, 2015
Merged

Join type promotion #1218

merged 15 commits into from
Sep 4, 2015

Conversation

llllllllll
Copy link
Member

closes #1193

edit: this pr also does generic type promotion of joined fields.

needs blaze/datashape#172

This also fixes an issue that could occur when joining on a single column name passed as a string when other column names were substrings of the join column.

on_left = self.on_left
on_right = self.on_right

right_params = self.rhs.schema[0].parameters[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can do this with self.rhs.measure.fields, which IMO is a bit more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there doesn't always appear to be a measure attribute on this type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm weird, is that the cause of the build failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure, I just installed pyspark to run this locally; however, I am running into new issues when trying to extend this to handle the join of int32 and int64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can ping you when I get something working.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'd install spark 1.3 for now, because I haven't worked up a PR to add support for 1.4 yet.

@llllllllll llllllllll added the wip label Aug 27, 2015
[name, extract_option(dt)
if isinstance(dt, Option) and
not isinstance(right_params[n], Option) else dt]
for n, (name, dt) in enumerate(self.lhs.schema[0].parameters[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, I think this should be self.lhs.measure.fields

@llllllllll
Copy link
Member Author

The build is going to fail until conda gets updated with the datashape pr. Getting 1 error locally where the order of columns are flipped. test_graph_double_join in the python compute tests. Not sure what is causing this. If you have a chance, some extra eyes would be apreciated @cpcloud.

edit: After walking through the test manaully, I believe that it was making an incorrect assertion. I updated the test and added a comment with some intermediate step to make it easier for people to validate this later.

@llllllllll llllllllll changed the title Join option types Join type promotion Aug 27, 2015
@llllllllll llllllllll removed the wip label Aug 27, 2015
@llllllllll
Copy link
Member Author

Should be ready to go now.

('A', 1, 5),
('F', 6, 1),
('F', 6, 2),
('F', 6, 4)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this just wrong before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see your comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is the line that fixes the column ordering here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did not intend on fixing this, it sort of fell out of the other changes. That is why I walked through the join manually to assure myself I still had the correct answer. I think what fixed this were the checks on collections.py:446 and 450. This is because we were doing an inclusion check on a field name against a string and 'a' is in 'name' but should not have been selected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense

@cpcloud cpcloud added the bug label Aug 27, 2015
@cpcloud cpcloud added this to the 0.8.3 milestone Aug 27, 2015
@llllllllll llllllllll force-pushed the join-option-types branch 2 times, most recently from f8b46df to 7171cf3 Compare September 1, 2015 18:05
if name not in self.on_right]
right_types = keymap(
dict(zip(on_right, on_left)).get,
dict(self.rhs.schema[0].parameters[0]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was it not possible to use self.rhs.measure here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, something didn't have a measure attribute, not sure what the types were here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just tried this, you can do self.rhs.dshape.measure.dict and remove the dict() call around it

# [3, C, 5],
# [6, F, 1],
# [6, F, 2],
# [6, F, 4]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the joined key all the way over to the left? is that just how the join dshape pops out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we construct the pairs we emit joined + left + right

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I the shape of the measure in the join puts the joined keys first

right_types = listpack(types_of_fields(on_right, rhs))
if len(left_types) != len(right_types):
raise ValueError(
'Length of on_left=%d not equal to lenght of on_right=%d' % (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo lenght should be length

llllllllll added a commit that referenced this pull request Sep 4, 2015
@llllllllll llllllllll merged commit 1f10384 into blaze:master Sep 4, 2015
@llllllllll llllllllll deleted the join-option-types branch September 4, 2015 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failing joins on field that is nullable in one table but not in the other
2 participants