Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] enable deterministic repartitioning when checkpoint is enabled #4807

Open
wants to merge 4 commits into
base: master
from

Conversation

@CodingCat
Copy link
Member

commented Aug 25, 2019

part of #4786

@CodingCat

This comment has been minimized.

Copy link
Member Author

commented Aug 25, 2019

@trams
Copy link
Contributor

left a comment

LGTM in general.
My only concern is

Why do we need to add a feature value at all to our hash function?
HashCode of Vector is murmur hash which is based purely on values of this vector.

If the method of adding a value from this vector makes the hashing better why don't Murmur do this? Is this because Murmur is essentially streaming?

If this method does not make the hashing "better" is there some other reason I am missing to do this?

Why can't we just use a hash value and HashPartitioner? It should simplify logic

@CodingCat CodingCat force-pushed the CodingCat:deterministic_partitioning_2 branch from a483154 to 155a30d Sep 4, 2019

case sparseVector: SparseVector =>
featureValueOfSparseVector(rowHashCode, sparseVector)
}
math.abs((rowHashCode.toLong + featureValue).toString.hashCode % numPartitions)

This comment has been minimized.

Copy link
@CodingCat

CodingCat Sep 4, 2019

Author Member

@trams please check this

this has been Long + Double and toString and hashCode

it shall prevent overflow

This comment has been minimized.

Copy link
@trams

trams Sep 12, 2019

Contributor
  1. Why can't we do
    math.abs((rowHashCode.toLong + featureValue) % numPartitions.toLong).toInt

given that numPartitions is small (< MAX_INT) the resulting .toInt won't actually truncate the Long of any useful info

  1. Your way definitely avoids overflows but I feel like .toString.hashCode may be overkill

This comment has been minimized.

Copy link
@CodingCat

CodingCat Sep 13, 2019

Author Member

Long + Double is a double, mod a double with an integer and then to int creates many conflicts in partition number

e.g. any value in [1.0, 2.0) will be assigned to partition 1 given, e.g., 3 partitions....

(actually rowHashCode.toDouble + featureValue is more helpful for reading

This comment has been minimized.

Copy link
@trams

trams Sep 16, 2019

Contributor

I see your point but I do not think that any feature value with [1.0, 2.0) will be assigned a partition 1 given that this featureValue should have different hash codes

I do not mind toString if it is better but I would prefer to do without it.

What about using rowHashCode.toLong + featureValue.toLong

P.S. I really do not have an experience building hash functions and I am always paranoid to accidentally take the bad one so I am fine with toString. I just want to make sure we looked at all options

This comment has been minimized.

Copy link
@trams

trams Sep 16, 2019

Contributor

Also what do you think about completely different option: "Use murmur to combine rowHashCode and featureValue into a good hash value"

P.S. I am just spitballing here so the idea may be completely stupid

This comment has been minimized.

Copy link
@CodingCat

CodingCat Sep 16, 2019

Author Member

AFK, just quick response to why [1,2) will be assigned to the same partition

Your propsoal is
math.abs((rowHashCode.toLong + featureValue) % numPartitions.toLong).toInt

Because of toInt here, 1.1.toInt will be 1, 1.6.toInt will be the same

This comment has been minimized.

Copy link
@trams

trams Sep 16, 2019

Contributor

I understand that 1.1.toInt = 1.6.toInt = 1 but it is not obvious to be it is a problem to apply .toInt to featureValue. Yes in case
1.rowHashCode.toLong = x, featureValue = 1.1 we would have featureValue.toInt = 1 the result will be (x + 1) % numPartitions
2. .rowHashCode.toLong = y, featureValue = 1.6 we would have featureValue.toInt = 1 the result will be (y + 1) % numPartitions

given that x != y because featureValue is part of Row and different it should (key word should) not be a problem

But I am pretty sure I am missing something else here. I do not want to bikeshed here and prematurely optimize (my concern with toString is mostly about performance and excessive allocations of Strings) cause I think this change is awesome and will solve a lot of problems

All this discussion made me think. Do we have any tests on "real" data which shows that this hash function is "uniform" to some extend?

@CodingCat

This comment has been minimized.

Copy link
Member Author

commented Sep 9, 2019

@trivialfis any idea on the flaky gpu test?

@trivialfis

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

Looks like network failure ...

@hcho3 hcho3 referenced this pull request Sep 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.