-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9793] [MLlib] [PySpark] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly #8166
Conversation
Test build #40766 has finished for PR 8166 at commit
|
Jenkins, test this please. |
Test build #40949 has finished for PR 8166 at commit
|
while k2 < v2_size and v2_values[k2] == 0: | ||
k2 += 1 | ||
|
||
if k1 >= v1_size or k2 >= v2_size: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: since k1
will be at most == v1_size
due to the earlier while
, checking for ==
here will suffice and is easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for k2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think checking k1 >= v1_size
is more robust than k1 == v1_size
, and Scala code also use the former one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, that's fine with me
LGTM after docstring change |
Test build #41666 has finished for PR 8166 at commit
|
if len(self) != other.size: | ||
return false | ||
return Vectors.equals(list(xrange(len(self))), self.array, other.indices, other.values) | ||
return NotImplemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it return False
?
@yanboliang Please update the PR to use the first 128 nonzeros entries to compute hash. |
d63d54e
to
3b8ac7a
Compare
Test build #42420 has finished for PR 8166 at commit
|
@@ -122,6 +123,15 @@ def _format_float_list(l): | |||
return [_format_float(x) for x in l] | |||
|
|||
|
|||
def _double_to_long_bits(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make the code more readable:
if isnan(value):
value = float('nan')
return struct.unpack('Q', struct.pack('d', value))[0]
Test build #42465 has finished for PR 8166 at commit
|
LGTM. Merged into master. @yanboliang |
@mengxr OK, I opened SPARK-10615 to track the |
PySpark DenseVector, SparseVector
__eq__
method should use semantics equality, and DenseVector can compared with SparseVector.Implement PySpark DenseVector, SparseVector
__hash__
method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.