Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for combining hashes in vector columns to HashingTransformer #4828

Open
wants to merge 4 commits into
base: master
from

Conversation

@yaeldekel
Copy link
Member

yaeldekel commented Feb 12, 2020

I am making this change so that CountTargetEncodingEstimator (see PR #4514 ) can start using HashingEstimator instead of HashJoiningTransform (see this comment in the above PR).

In addition to enabling CountTargetEncodingEstimator to use it, this will fix a bug when splitting datasets in the CV command and in the train-test/CV APIs. If a stratification column that is a vector type is specified, currently an exception will be thrown because the RangeFilter applied after hashing cannot handle vector columns.

@yaeldekel yaeldekel requested a review from dotnet/mlnet-core as a code owner Feb 12, 2020
@yaeldekel yaeldekel force-pushed the yaeldekel:hashing branch from f4229bb to 249db91 Feb 12, 2020
yaeldekel added 2 commits Feb 12, 2020
…y of HashingEstimator.
@yaeldekel

This comment has been minimized.

Copy link
Member Author

yaeldekel commented Feb 17, 2020

Question: some of the input types for hashing have missing values (float, double and key type). The hashing transformer maps missing values in the input, to 0, the missing value of key type. What should we do in the combine case?
I think the correct thing to do is to return 0 if any of the slots have missing values (that way, the behavior of a length-1 vector is identical whether Combine is true or false), but I would like to know what other people think.

else if (((ISeededEnvironment)env).Seed.HasValue)
columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value);
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value, combine: true);

This comment has been minimized.

Copy link
@codemzs

codemzs Feb 21, 2020

Member

You can merge these conditions in one by defining var localSeed = seed.HasValue ? seed.Value : (ISeededEnvironment)env).Seed.HasValue ? (ISeededEnvironment)env).Seed.Value : null;

if(localSeed.HasValue)
...
else
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true);

This comment has been minimized.

Copy link
@justinormont

justinormont Feb 21, 2020

Member

I don't think we want HashingEstimator listening to the global seed. The hashing is a different meaning of "seed"; this isn't a PRNG seed.

See #4752 (comment)

/cc @najeeb-kazmi

@@ -122,14 +128,15 @@ private static VersionInfo GetVersionInfo()
return new VersionInfo(
modelSignature: "HASHTRNS",
// verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Invert hash key values, hash fix
//verWrittenCur: 0x00010002, // Invert hash key values, hash fix

This comment has been minimized.

Copy link
@codemzs

codemzs Feb 21, 2020

Member

v [](start = 18, length = 1)

space

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants
You can’t perform that action at this time.