New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join type promotion #1218

Merged
merged 15 commits into from Sep 4, 2015

Conversation

Projects
None yet
2 participants
@llllllllll
Member

llllllllll commented Aug 27, 2015

closes #1193

edit: this pr also does generic type promotion of joined fields.

needs blaze/datashape#172

This also fixes an issue that could occur when joining on a single column name passed as a string when other column names were substrings of the join column.

@llllllllll llllllllll force-pushed the quantopian:join-option-types branch from af6a0a6 to 6b5f4af Aug 27, 2015

on_left = self.on_left
on_right = self.on_right
right_params = self.rhs.schema[0].parameters[0]

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

you can do this with self.rhs.measure.fields, which IMO is a bit more clear

This comment has been minimized.

@llllllllll

llllllllll Aug 27, 2015

Member

there doesn't always appear to be a measure attribute on this type.

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

hm weird, is that the cause of the build failure?

This comment has been minimized.

@llllllllll

llllllllll Aug 27, 2015

Member

not sure, I just installed pyspark to run this locally; however, I am running into new issues when trying to extend this to handle the join of int32 and int64.

This comment has been minimized.

@llllllllll

llllllllll Aug 27, 2015

Member

I can ping you when I get something working.

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

Ok. I'd install spark 1.3 for now, because I haven't worked up a PR to add support for 1.4 yet.

joined = [
[name, extract_option(dt)
if isinstance(dt, Option) and
not isinstance(right_params[n], Option) else dt]

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

should this be right_params[n][1]?

This comment has been minimized.

@llllllllll

llllllllll Aug 27, 2015

Member

yeah, I am changing this up a bit to use: blaze/datashape#172

@llllllllll llllllllll added the wip label Aug 27, 2015

[name, extract_option(dt)
if isinstance(dt, Option) and
not isinstance(right_params[n], Option) else dt]
for n, (name, dt) in enumerate(self.lhs.schema[0].parameters[0])

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

same here, I think this should be self.lhs.measure.fields

@llllllllll llllllllll force-pushed the quantopian:join-option-types branch from 70cd992 to c083b79 Aug 27, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Aug 27, 2015

The build is going to fail until conda gets updated with the datashape pr. Getting 1 error locally where the order of columns are flipped. test_graph_double_join in the python compute tests. Not sure what is causing this. If you have a chance, some extra eyes would be apreciated @cpcloud.

edit: After walking through the test manaully, I believe that it was making an incorrect assertion. I updated the test and added a comment with some intermediate step to make it easier for people to validate this later.

@llllllllll llllllllll changed the title from Join option types to Join type promotion Aug 27, 2015

@llllllllll llllllllll removed the wip label Aug 27, 2015

@llllllllll

This comment has been minimized.

Member

llllllllll commented Aug 27, 2015

Should be ready to go now.

('A', 1, 5),
('F', 6, 1),
('F', 6, 2),
('F', 6, 4)])

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

was this just wrong before?

This comment has been minimized.

@cpcloud

cpcloud Aug 27, 2015

Member

ah i see your comment

This comment has been minimized.

@cpcloud

cpcloud Sep 3, 2015

Member

where is the line that fixes the column ordering here?

This comment has been minimized.

@llllllllll

llllllllll Sep 3, 2015

Member

I actually did not intend on fixing this, it sort of fell out of the other changes. That is why I walked through the join manually to assure myself I still had the correct answer. I think what fixed this were the checks on collections.py:446 and 450. This is because we were doing an inclusion check on a field name against a string and 'a' is in 'name' but should not have been selected.

This comment has been minimized.

@cpcloud

cpcloud Sep 4, 2015

Member

Ah makes sense

@cpcloud cpcloud added the bug label Aug 27, 2015

@cpcloud cpcloud added this to the 0.8.3 milestone Aug 27, 2015

@llllllllll llllllllll force-pushed the quantopian:join-option-types branch from d0397d1 to 6894ac4 Aug 28, 2015

@llllllllll llllllllll force-pushed the quantopian:join-option-types branch 2 times, most recently from f8b46df to 7171cf3 Sep 1, 2015

if name not in self.on_right]
right_types = keymap(
dict(zip(on_right, on_left)).get,
dict(self.rhs.schema[0].parameters[0]),

This comment has been minimized.

@cpcloud

cpcloud Sep 1, 2015

Member

was it not possible to use self.rhs.measure here?

This comment has been minimized.

@llllllllll

llllllllll Sep 1, 2015

Member

no, something didn't have a measure attribute, not sure what the types were here

This comment has been minimized.

@cpcloud

cpcloud Sep 1, 2015

Member

just tried this, you can do self.rhs.dshape.measure.dict and remove the dict() call around it

(name, promote(dt, right_types[name], promote_option=False))
for n, (name, dt) in enumerate(filter(
compose(op.contains(on_left), first),
self.lhs.schema[0].parameters[0]

This comment has been minimized.

@cpcloud

cpcloud Sep 1, 2015

Member

here you can do self.lhs.dshape.measure.fields

This comment has been minimized.

@cpcloud

cpcloud Sep 1, 2015

Member

these are longer names, but IMO much more readable

# [3, C, 5],
# [6, F, 1],
# [6, F, 2],
# [6, F, 4]]

This comment has been minimized.

@cpcloud

cpcloud Sep 1, 2015

Member

why is the joined key all the way over to the left? is that just how the join dshape pops out?

This comment has been minimized.

@llllllllll

llllllllll Sep 1, 2015

Member

when we construct the pairs we emit joined + left + right

This comment has been minimized.

@llllllllll

llllllllll Sep 1, 2015

Member

Also, I the shape of the measure in the join puts the joined keys first

right_types = listpack(types_of_fields(on_right, rhs))
if len(left_types) != len(right_types):
raise ValueError(
'Length of on_left=%d not equal to lenght of on_right=%d' % (

This comment has been minimized.

@cpcloud

cpcloud Sep 4, 2015

Member

small typo lenght should be length

llllllllll added a commit that referenced this pull request Sep 4, 2015

@llllllllll llllllllll merged commit 1f10384 into blaze:master Sep 4, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@llllllllll llllllllll deleted the quantopian:join-option-types branch Sep 4, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment