New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ngram hashing to estimator #1811

Open
wants to merge 25 commits into
base: master
from

Conversation

Projects
None yet
6 participants
@Ivanidzo4ka
Member

Ivanidzo4ka commented Dec 3, 2018

Fixes #1632

" col=F11:One col={name=F12 src=One ngram=4 max=3 max=4 max=5} col={name=F13 src=One ngram=3 skips=2}",
" col=F21:Vec col={name=F22 src=Vec max=20 ngram=4}}",
"xf=WordBag{col={name=F23 src=Vec max=10 ngram=3 skips=2}}",
"xf=ChooseColumns{col=Label col=F21 col=F22 col=F23 col=F11 col=F12 col=F13}",

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 3, 2018

Member

ChooseColumns [](start = 24, length = 13)

SelectColumns everywhere #Closed

Ivanidzo4ka added some commits Dec 3, 2018

[assembly: LoadableClass(NgramHashingTransformer.Summary, typeof(NgramHashingTransformer), null, typeof(SignatureLoadModel),
"Ngram Hash Transform", NgramHashingTransformer.LoaderSignature)]
[assembly: LoadableClass(typeof(IRowMapper), typeof(NgramHashingTransformer), null, typeof(SignatureLoadRowMapper),
"Ngram Hash Transform", NgramHashingTransformer.LoaderSignature)]
namespace Microsoft.ML.Transforms.Text

This comment has been minimized.

@singlis

singlis Dec 3, 2018

Member

Should this stay in Microsoft.ML.Transforms.Text? Or should this move up one level to Microsoft.ML.Transforms? #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

I prefer to keep it as is. All text transforms live in this namespace.


In reply to: 238483123 [](ancestors = 238483123)

private readonly VBuffer<ReadOnlyMemory<char>>[] _slotNames;
private readonly ColumnType[] _slotNamesTypes;
private const string RegistrationName = "NgramHash";
public NgramHashingTransformer(IHostEnvironment env, params ColumnInfo[] columns) :

This comment has been minimized.

@singlis

singlis Dec 3, 2018

Member

Comments. #Resolved

using Conditional = System.Diagnostics.ConditionalAttribute;
public sealed class NgramHashingTransformer : RowToRowMapperTransformBase
public sealed class NgramHashingTransformer : RowToRowTransformerBase

This comment has been minimized.

@singlis

singlis Dec 3, 2018

Member

NgramHashingTransformer [](start = 24, length = 23)

Comments #Resolved

InitColumnTypes();
if (Utils.Size(invertIinfos) > 0)
if (Utils.Size(columnWithInvertHash) > 0)

This comment has been minimized.

@singlis

singlis Dec 4, 2018

Member

Can you comment what this is doing? Also would it help to move to a Train function to separate this logic out from the constructor? #Resolved

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

If possible I think it would make it clearer with a separate training function.


In reply to: 238487172 [](ancestors = 238487172)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

I sprinkle code with some comments, maybe they would be useful.


In reply to: 238786759 [](ancestors = 238786759,238487172)

TextModelHelper.SaveAll(Host, ctx, _exes.Length, _slotNames);
// int amount of columns
// columns
for (int i = 0; i < columnsLenght; i++)

This comment has been minimized.

@singlis

singlis Dec 4, 2018

Member

columnsLenght [](start = 32, length = 13)

mispelling... #Resolved

@@ -217,7 +217,7 @@ public void NgramWorkout()
var est = new WordTokenizingEstimator(Env, "text", "text")
.Append(new ValueToKeyMappingEstimator(Env, "text", "terms"))
.Append(new NgramExtractingEstimator(Env, "terms", "ngrams"))
.Append(new NgramHashEstimator(Env, "terms", "ngramshash"));
.Append(new NgramHashingEstimator(Env, "terms", "ngramshash"));

This comment has been minimized.

@singlis

singlis Dec 4, 2018

Member

Is there any benefit to doing a workout test on just the ngram hash estimator? For example, will the invalid data get caught by another transform?

Also what about more tests around the estimator class itself? Things like Saving (old and new as it looks like the version was revved), checking the output schema passing in bad values on construction, etc.
#Resolved

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Would be good to see that you can load the old format


In reply to: 238489293 [](ancestors = 238489293)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

I can't load old format, I broke backward compatibility with this changes.


In reply to: 238835998 [](ancestors = 238835998,238489293)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

To @singlis@microsoft.com we are doing workout test here already.


In reply to: 238839172 [](ancestors = 238839172,238835998,238489293)

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

Is it fine to break backward compatibility?


In reply to: 238839333 [](ancestors = 238839333,238839172,238835998,238489293)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 10, 2018

Member

I don't have other options. We break that compatibility in some other transforms already.


In reply to: 240384866 [](ancestors = 240384866,238839333,238839172,238835998,238489293)

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

Ok! Thanks for explaining.


In reply to: 240385493 [](ancestors = 240385493,240384866,238839333,238839172,238835998,238489293)

Ivanidzo4ka added some commits Dec 4, 2018

@Ivanidzo4ka Ivanidzo4ka requested review from sfilipi and artidoro Dec 4, 2018

private readonly VBuffer<ReadOnlyMemory<char>>[] _slotNames;
private readonly ColumnType[] _slotNamesTypes;
private const string RegistrationName = "NgramHash";
public NgramHashingTransformer(IHostEnvironment env, params ColumnInfo[] columns) :
base(Contracts.CheckRef(env, nameof(env)).Register(nameof(NgramHashingTransformer)))

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

From the constructor below it seems that this is a trainable transformation, why does it have a public constructor then? #Resolved

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Still don't understand why you have a public constructor for a trainable transform?


In reply to: 238786341 [](ancestors = 238786341)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

It's only trainable in case we need to invert hash.
In case if user wants only ngrams, we just hash them without any training phase.


In reply to: 238842839 [](ancestors = 238842839,238786341)

_exes[iinfo] = new ColInfoEx(ctx);
InitColumnTypes();
TextModelHelper.LoadAll(Host, ctx, _exes.Length, out _slotNames, out _slotNamesTypes);
// int amount of columns

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

amount [](start = 20, length = 6)

"number of" #Resolved

/// takes tokenized text as input while <see cref="WordHashBagEstimator"/> tokenizes text internally.
/// </summary>
/// <param name="env">The environment.</param>
/// <param name="inputColumn">The column containing text to compute bag of word vector.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

The column containing text to compute bag of word vector. [](start = 38, length = 57)

I would change the comment to something like:
Name of input column containing tokenized text. #Resolved

/// </summary>
/// <param name="env">The environment.</param>
/// <param name="inputColumn">The column containing text to compute bag of word vector.</param>
/// <param name="outputColumn">The column containing bag of word vector. Null means <paramref name="inputColumn"/> is replaced.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

The column containing bag of word vector [](start = 39, length = 40)

Name of output column, will contain the ngram vector. #Resolved

/// <param name="env">The environment.</param>
/// <param name="inputColumns">The columns containing text to compute bag of word vector.</param>
/// <param name="outputColumn">The column containing output tokens.</param>
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Should be same as above. #Resolved

/// </summary>
/// <param name="env">The environment.</param>
/// <param name="columns">Pairs of columns to compute bag of word vector.</param>
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Technically you can have many input columns and one output column. I am not sure that pairs is the right term. Something like mapping might work. Otherwise you could just explain what it does using something like "Defines the input and output columns for the transformation." #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

Are you talking about hashBits or columns?


In reply to: 238801145 [](ancestors = 238801145)

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

columns


In reply to: 238839036 [](ancestors = 238839036,238801145)

/// <param name="allLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
/// <param name="seed">Hashing seed.</param>
/// <param name="ordered">Whether the position of each source column should be included in the hash (when there are multiple source columns).</param>
/// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Limit the number of keys used to generate the slot name to this many. [](start = 37, length = 69)

I am not certain that I understand the relationship between number of keys used to generate the slot name and invert hashing, would be helpful to have some more details. #Resolved

/// takes tokenized text as input while <see cref="WordHashBagEstimator"/> tokenizes text internally.
/// </summary>
/// <param name="env">The environment.</param>
/// <param name="columns">Array of columns with information how to transform data.</param>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Array of columns with information how to transform data. [](start = 34, length = 56)

"Specifies the behavior of the transformation." #Resolved

return true;
}
internal const string ExpectedColumnType = "Expected vector of Key type, and Key is convertable to U4";

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Misspelling #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

where?


In reply to: 238807293 [](ancestors = 238807293)

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Sorry, convertible instead of convertable


In reply to: 238813469 [](ancestors = 238813469,238807293)

@artidoro

This comment has been minimized.

Contributor

artidoro commented Dec 4, 2018

    /// <param name="outputColumn">The column containing bag of word vector. Null means <paramref name="inputColumn"/> is replaced.</param>

Same comments for the documentation in this file. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextCatalog.cs:413 in bacce51. [](commit_id = bacce51, deletion_comment = False)

@@ -8,6 +8,8 @@
using Microsoft.ML.Runtime.Model;
using Microsoft.ML.Runtime.RunTests;
using Microsoft.ML.Runtime.Tools;
using Microsoft.ML.StaticPipe;
using Microsoft.ML.Transforms;

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

Extra line
#Resolved

/// <summary>
/// Public constructor corresponding to SignatureDataTransform.
/// <summary>

This comment has been minimized.

@artidoro

artidoro Dec 4, 2018

Contributor

adjust summary #Resolved

@sfilipi

This comment has been minimized.

Member

sfilipi commented Dec 4, 2018

using Microsoft.ML.Core.Data;

favor: can you rename this file to OneHotEncoding since both the transformer and the estimator live in here. #Closed


Refers to: src/Microsoft.ML.Transforms/OneHotEncodingTransformer.cs:5 in bacce51. [](commit_id = bacce51, deletion_comment = False)

@@ -1,33 +1,42 @@
// Licensed to the .NET Foundation under one or more agreements.
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

This comment has been minimized.

@sfilipi

sfilipi Dec 4, 2018

Member

Please rename to NgramHashing, since both the estimator and the transformer are on the same file. #Closed

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 4, 2018

Member

I will, but in last iteration of this PR.


In reply to: 238849479 [](ancestors = 238849479)

"Contained some null or empty items");
if (invertHash < -1)
throw Contracts.ExceptParam(nameof(invertHash), "Value too small, must be -1 or larger");
if (invertHash != 0 && hashBits >= 31)

This comment has been minimized.

@sfilipi

sfilipi Dec 4, 2018

Member

= 31 [](start = 48, length = 5)

not your doing, but this is everywhere as a plain number... #WontFix

@@ -88,16 +86,26 @@ public sealed class Arguments
public bool Bag = KeyToVectorMappingEstimator.Defaults.Bag;
}
public class ColumnInfo
/// <summary>
/// Describes how the transformer handles one column pair.

This comment has been minimized.

@wschin

wschin Dec 4, 2018

Member
Suggested change Beta
/// Describes how the transformer handles one column pair.
/// Describes how one column (name: Input) would be mapped to another column (name: Output) in the transformer.

Ivanidzo4ka added some commits Dec 5, 2018

bla
/// <summary>
/// Describes how the transformer handles one column pair.
/// </summary>
/// <param name="input">Name of input column.</param>

This comment has been minimized.

@justinormont

justinormont Dec 5, 2018

Member

Would it be useful for the user's understanding/ease-of-use to specify the acceptable input data type(s)? Given the name, I assume only Key type is acceptable.

/// Describes how the transformer handles one column pair.
/// </summary>
/// <param name="input">Name of input column.</param>
/// <param name="output">Name of the column resulting from the transformation of <paramref name="input"/>. Null means <paramref name="input"/> is replaced. </param>

This comment has been minimized.

@justinormont

justinormont Dec 5, 2018

Member

Extra trailing space #Resolved

@Ivanidzo4ka

This comment has been minimized.

Member

Ivanidzo4ka commented Dec 10, 2018

@artidoro @sfilipi can you guys take a look?

}
}
[BestFriend]
internal const string Summary = "Converts input values (words, numbers, etc.) to index in a dictionary.";

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

If they are already internal what is the advantage of making them BestFriend? And why are you not making the TermManagerLoaderSignature BestFriend? There are also other files in this pr where you could add the attribute if it is needed. #WontFix

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 12, 2018

Member

Can't build project without best friend attribute.


In reply to: 240377196 [](ancestors = 240377196)

/// <param name="output">Name of the column resulting from the transformation of <paramref name="input"/>. Null means <paramref name="input"/> is replaced.</param>
/// <param name="maxNumTerms">Maximum number of terms to keep per column when auto-training.</param>
/// <param name="sort">How items should be ordered when vectorized. By default, they will be in the order encountered.
/// If by value items are sorted according to their default comparison, for example, text sorting will be case sensitive (for example, 'A' then 'Z' then 'a').</param>

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

by value [](start = 19, length = 8)

What does "by value items" mean?

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

Here and in other docs.


In reply to: 240377704 [](ancestors = 240377704)

This comment has been minimized.

@artidoro

artidoro Dec 12, 2018

Contributor

I still don't understand this sentence


In reply to: 240385251 [](ancestors = 240385251,240377704)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Dec 12, 2018

Member

Are you confused by "by" or in general value items?


In reply to: 241128087 [](ancestors = 241128087,240385251,240377704)

This comment has been minimized.

@artidoro

artidoro Dec 12, 2018

Contributor

So earlier you talk about the default setting I understand when that happens. I don't understand what the following refers to: "If by value items are stored according to their default comparison"


In reply to: 241130103 [](ancestors = 241130103,241128087,240385251,240377704)

This comment has been minimized.

@artidoro

artidoro Dec 12, 2018

Contributor

I don't know what a "by value item" is, so that makes the entire sentence hard to understand.


In reply to: 241132122 [](ancestors = 241132122,241130103,241128087,240385251,240377704)

host.CheckValue(ctx, nameof(ctx));
ctx.CheckAtModel(GetVersionInfo());
return new NgramHashingTransformer(host, ctx);
}

This comment has been minimized.

@artidoro

artidoro Dec 10, 2018

Contributor

I think you are repeating some checks here and in NgramHashingTransformer(IHostEnvironment env, ModelLoadContext ctx). Maybe just have one method, the constructor? #Resolved

Ivanidzo4ka added some commits Dec 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment