New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

swapping the order or arguments on the constructors of the ConversionExtensionsCatalog. Internalizing the constructors. #2118

Open
wants to merge 6 commits into
base: master
from

Conversation

3 participants
@sfilipi
Copy link
Member

sfilipi commented Jan 11, 2019

Working towards resolving #2064 by swapping the order or arguments on the constructors of the ConversionExtensions Catalog. Internalizing the constructors.

@sfilipi sfilipi requested review from Ivanidzo4ka , TomFinley and glebuk Jan 11, 2019

@sfilipi sfilipi added the api label Jan 11, 2019

@sfilipi sfilipi added this to In progress in Project 13 via automation Jan 11, 2019

@sfilipi sfilipi added this to the 0119 milestone Jan 11, 2019

@sfilipi sfilipi self-assigned this Jan 11, 2019

@@ -39,7 +39,7 @@ public void TensorFlowTransforCifarEndToEndTest()
.Append(new ImagePixelExtractingEstimator(mlContext, "ImageCropped", "Input", interleave: true))
.Append(new TensorFlowEstimator(mlContext, model_location, new[] { "Input" }, new[] { "Output" }))
.Append(new ColumnConcatenatingEstimator(mlContext, "Features", "Output"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label", "Label"))

This comment has been minimized.

@sfilipi

sfilipi Jan 11, 2019

Member

, "Label" [](start = 77, length = 9)

side effect of this change is that we cannot have the output column default to null, and therefore substitute it with the input column when not provided... #Resolved

This comment has been minimized.

@TomFinley

TomFinley Jan 11, 2019

Contributor

Let's reverse that. We should retain the behavior and attitude that existed before the fiasco began, of the "name" (what was misidentified as "output") always being specified, and in the case where there is a single "source" (what was misidentified as "input"), it is taken as being the same as the "name."


In reply to: 246970621 [](ancestors = 246970621)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

This was marked as resolved, but I still see that we've retained this change.


In reply to: 246974415 [](ancestors = 246974415,246970621)

/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null,
internal HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn,

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Jan 11, 2019

Member

internal [](start = 8, length = 8)

why estimator constructor becomes internal? #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Jan 11, 2019

Member

and shouldn't you swap output and input?


In reply to: 246971081 [](ancestors = 246971081)

This comment has been minimized.

@sfilipi

sfilipi Jan 11, 2019

Member

Constructors become internal, so that users instantiate them only through the MlContext. See #2100


In reply to: 246971081 [](ancestors = 246971081)

@@ -92,24 +92,24 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// <param name="outputColumn">The name of the output column.</param>
/// <param name="bag">Whether bagging is used for the conversion. </param>
public static KeyToVectorMappingEstimator MapKeyToVector(this TransformsCatalog.ConversionTransforms catalog,
string inputColumn, string outputColumn = null, bool bag = KeyToVectorMappingEstimator.Defaults.Bag)
=> new KeyToVectorMappingEstimator(CatalogUtils.GetEnvironment(catalog), inputColumn, outputColumn, bag);
string outputColumn, string inputColumn, bool bag = KeyToVectorMappingEstimator.Defaults.Bag)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Jan 11, 2019

Member

inputColumn [](start = 40, length = 11)

I'm not a Tom, and I can't speak from his name, but I don't remember him objecting on idea of one of column be optional, and I thinks it's perfectly fine to have second column nullable and replace it with first one. #Resolved

This comment has been minimized.

@TomFinley

TomFinley Jan 11, 2019

Contributor

Yes, in fact that was our prior behavior.


In reply to: 246971850 [](ancestors = 246971850)

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Jan 11, 2019

        => new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, file, termsColumn, loaderFactory);

I don't think this constructor and constructor for estimator which accepts term column and loader should exist. (you can just remove term column and loader)
First because we don't want to expose IDataLoader interface and IComponentFactory to user.
Second because we already have ValueMapping estimator which allow you provide custom mappings and just do mappings from file is subset of that functionality. #Pending


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:127 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

: this(env, new KeyToVectorMappingTransformer(env, columns))
{
}

public KeyToVectorMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, bool bag = Defaults.Bag)
: this(env, new KeyToVectorMappingTransformer(env, new KeyToVectorMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, bag)))
internal KeyToVectorMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn, bool bag = Defaults.Bag)

This comment has been minimized.

@TomFinley

TomFinley Jan 11, 2019

Contributor

outputColumn, string inputColumn [](start = 74, length = 32)

outputColumn, string inputColumn [](start = 74, length = 32)

Thank you @sfilipi! It was in a note somewhat down on the issue, but in a reply to @@justinormont where we were talking about consistency with existing argument names, I also observed that the names "inputs" and "outputs" are incorrect (since the inputs to outputs to estimators/transforms are not columns, but IDataViews), and should retain something closer the existing names we have for this name and source. Indeed, I think the bad naming was part of the reason some people got so badly confused in the first place. #Resolved

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Jan 11, 2019

Member

But these ones are InputColumn and OutputColumn and estimator operate over columns in IDataView. What is wrong with them? And why name and source and not destination and source? name is such a broad term to use it.


In reply to: 246974093 [](ancestors = 246974093)

This comment has been minimized.

@sfilipi

sfilipi Jan 11, 2019

Member

I hope i am not annoying and further fueling the conversation, but sharing my personal opinion:

1- I agree that the order of arguments should be swapped, and have destination, than source.
2- I find the name "name" for the output column confusing too. Yes it does make sense in the scenario, but you have to blink about 4 times before it does so; just because, as Ivan says, it is such a generic thing.
The names InputColumn, and OutputColumn make a lot more sense to me too, since they are actually column names..

The end!


In reply to: 246974845 [](ancestors = 246974845,246974093)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

The term name and source is a well established name in this codebase. The concept of the input and output for a transform is already being established as being the dataviews themselves. Transforms do not take columns as inputs and outputs, they take data views as inputs and outputs -- that's obvious from the interface specification. So: let's jsut agree, the dataviews are the inputs and outputs, full stop. In the case where a transform happens to be producing an additional column, we have parameterized the transform as producing a column with that "name." And when that parameterization also happens to pay attention to one column, we often call that a "source." Always.

It's interesting that you have a different interpretation, and I'd like to emphasize that I don't discount it, but it is distinct from how we've architected the old IDataTransform and I don't like the new interpretation for the reasons already discussed at length.

Going forward, I think this shows how dangerous it is to introduce a new interpretation without discussing that new interpretation with anyone, discussing its merits, etc. I'm sorry that was done, but not sorry that we're fixing it.


In reply to: 247204046 [](ancestors = 247204046,246974845,246974093)

This comment has been minimized.

@Ivanidzo4ka

Ivanidzo4ka Jan 15, 2019

Member

Transforms do not take columns as inputs and outputs, they take data views as inputs and outputs

Estimators during it construction doesn't take any dataview, they take columns which expected to be in dataview. We don't touch dataview till we call Fit. We do schema propagation for GetOutputSchema on SchemaShape.Column,
Only thing which take data views as inputs as far as I know is AppendDataViewRows transform.

The term name and source is a well established name in this codebase.

Internally, yes, mostly as legacy for command line. Legacy doesn't mean it's a great and useful for another task (Public Api).

let's jsut agree, the dataviews are the inputs and outputs, full stop.

I'm as user, don't see input columns and output columns in estimators as input and output dataviews. I think @sfilipi has the same opinion. I can't even come up with estimator/transformer which would produce multiple dataviews (maybe evaluator with metrics as dataviews)

So far we have two people who think name and source is not a great idea (especially name)
We do it not because we want to make you mad, we do it because we also care about project.

without discussing that new interpretation with anyone

Can we actually have discussion regarding name and source vs inputColumn and outputColumn ? I would love to hear other people opinion, don't you?

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Estimators during it construction doesn't take any dataview, they take columns which expected to be in dataview. We don't touch dataview till we call Fit.

If you read what I wrote, you'll see I was explicitly talking about transformers, but estimators are similar. They don't take inputs and outputs as columns. They take inputs as dataviews (as you say), and they the output of that is a transformer. This, again, already covered in #2064.

We do schema propagation for GetOutputSchema on SchemaShape.Column,

You're mistaken. Check the IEstimator interface again.

SchemaShape GetOutputSchema(SchemaShape inputSchema);

As you see, again, inputs and outputs to this process are not columnar, they in fact are more global than this.

Internally, yes, mostly as legacy for command line. Legacy doesn't mean it's a great and useful for another task (Public Api).

Already covered in #2064... to summarize, you must justify inconsistencies. We should try to minimize inconsistencies. Just pointing out that sometimes we've judged it healthy that the two should be unaligned because they're in different settings is not a free pass to introduce whatever inconsistencies you want. Not only was this change unjustified, it has the flaws that I have identified.

Can we actually have discussion regarding name and source vs inputColumn and outputColumn ? I would love to hear other people opinion, don't you?

At some point, as again established in #2064, one of my responsibilities (which, just to emphasize, I do not enjoy) is to judge when discussions cease to become productive, so that we can actually move on with the work that must be done. While I'd love to have the luxury of enough free time to just endlessly circle around things like this, we can't. Let's move on.


In reply to: 248024251 [](ancestors = 248024251)

public ValueToKeyMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, int maxNumTerms = Defaults.MaxNumTerms, ValueToKeyMappingTransformer.SortOrder sort = Defaults.Sort) :
this(env, new [] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, maxNumTerms, sort) })
internal ValueToKeyMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn, int maxNumTerms = Defaults.MaxNumTerms, ValueToKeyMappingTransformer.SortOrder sort = Defaults.Sort) :
this(env, new [] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn, maxNumTerms, sort) })

This comment has been minimized.

@TomFinley

TomFinley Jan 11, 2019

Contributor

inputColumn, outputColumn [](start = 74, length = 25)

It would be well to make sure we're consistent everywhere, and also swap these. #Resolved

@shauheen shauheen added this to In Progress in v0.10 Jan 11, 2019

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 13, 2019

    public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, params (string input, string output)[] columns)

I will fix this in the next PR here, because this goes all the way down to the OneToOneTransformerBase, and that change will require touching everything .. #Pending


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:75 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 13, 2019

        => new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, file, termsColumn, loaderFactory);

Left the ctor, because it is used through the codebase; it;s internal. Removed the reference here.


In reply to: 453322670 [](ancestors = 453322670)


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:127 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

@sfilipi sfilipi force-pushed the sfilipi:trasformerEstimatorInputOutputSwap branch from c6961d9 to 0064b05 Jan 13, 2019

/// <param name="input">Name of input column.</param>
/// <param name="output">Name of the column resulting from the transformation of <paramref name="input"/>. Null means <paramref name="input"/> is replaced.</param>
/// <param name="source">Name of column to work on.</param>
/// <param name="name">Name of the column resulting from the transformation of <paramref name="source"/>. Null means <paramref name="source"/> is replaced.</param>

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Null means is replaced [](start = 118, length = 48)

Other way around. null on source means this. Also FYI, while I'm pleased the see that the parameters themselves are in the right order, the params tags themselves are not reversed. #Resolved

public readonly string Input;
public readonly string Output;
public readonly string Source;
public readonly string Name;
public readonly int HashBits;

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

While we're at it, might want to put these in the right order. #Resolved

public readonly DataKind OutputKind;
public readonly KeyRange OutputKeyRange;

/// <summary>
/// Describes how the transformer handles one column pair.
/// </summary>
/// <param name="input">Name of input column.</param>
/// <param name="output">Name of output column.</param>
/// <param name="source">Name of input column.</param>

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Name of input column [](start = 37, length = 20)

Similar comment to other file about ordering of tags, description, etc. #Resolved

/// <param name="outputKind">The expected type of the converted column.</param>
/// <param name="outputKeyRange">New key range if we work with key type.</param>
public TypeConvertingTransformer(IHostEnvironment env, string inputColumn, string outputColumn, DataKind outputKind, KeyRange outputKeyRange = null)
: this(env, new ColumnInfo(inputColumn, outputColumn, outputKind, outputKeyRange))
public TypeConvertingTransformer(IHostEnvironment env, string source, string name, DataKind outputKind, KeyRange outputKeyRange = null)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

string source, string name [](start = 63, length = 26)

This is not in proper order. #Pending

This comment has been minimized.

@sfilipi

sfilipi Jan 16, 2019

Member

I will tackle the transformers in the next pr, because i have to change all of them, they all derive from OneToOneTransformer.

Agree, it is confusing because I did partial work, on renaming the args.


In reply to: 248078355 [](ancestors = 248078355)

@@ -141,7 +130,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
this TransformsCatalog.ConversionTransforms catalog,
IEnumerable<TInputType> keys,
IEnumerable<TOutputType> values,
params (string source, string name)[] columns)
params (string inputColumn, string outputColumn)[] columns)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

inputColumn, string outputColumn [](start = 27, length = 32)

Remember, name, source, not source, name. And definitely not input, output. :) #Pending

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Also I think for conistency sake we want the column mapping to be at the head, even when we have to forgo the convenience of the params.


In reply to: 248079402 [](ancestors = 248079402)

This comment has been minimized.

@sfilipi

sfilipi Jan 16, 2019

Member

I will do this on the next PR, because the tuple is passed directly from the estimator -> transfomer-> OneToOnetransformer base class.

I agree that it is confusing to review, because i did partial work with renaming the tuple items.


In reply to: 248079667 [](ancestors = 248079667,248079402)

/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null,
internal HashingEstimator(IHostEnvironment env, string outputColumn, string inputColumn = null,

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

outputColumn [](start = 63, length = 12)

Another remnant of the bad input/output names. Also another case where the XML comments should be made less awkward, as noted elsewhere. #Resolved

: this(env, new KeyToVectorMappingTransformer(env, columns))
{
}

public KeyToVectorMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, bool bag = Defaults.Bag)
: this(env, new KeyToVectorMappingTransformer(env, new KeyToVectorMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, bag)))
internal KeyToVectorMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn = null, bool bag = Defaults.Bag)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

outputColumn, string inputColumn [](start = 74, length = 32)

Another name, source... #Resolved

public TypeConvertingEstimator(IHostEnvironment env,
string inputColumn, string outputColumn = null,
internal TypeConvertingEstimator(IHostEnvironment env,
string outputColumn, string inputColumn = null,

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

outputColumn, string inputColumn [](start = 19, length = 32)

Here's another. I'm just going to stop commenting on these, maybe self-review, double check them? #Resolved

/// <param name="inputColumn">Name of the column to be transformed.</param>
/// <param name="outputColumn">Name of the output column. If this is null '<paramref name="inputColumn"/>' will be used.</param>
/// <param name="source">Name of the column to be transformed.</param>
/// <param name="name">Name of the column produced.</param>

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

/// Name of the column produced. [](start = 8, length = 59)

Reverse <param tags. #Resolved

/// <param name="inputColumn">Name of the column to be transformed.</param>
/// <param name="outputColumn">Name of the output column. If this is null '<paramref name="inputColumn"/>' will be used.</param>
/// <param name="source">Name of the column to be transformed.</param>
/// <param name="name">Name of the column produced.</param>

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Name of the column produced [](start = 31, length = 27)

To my eyes, this phrasing is awkward. What was wrong with the existing descriptors we already used for this in existing tooling?

public abstract class SourceNameColumnBase
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the new column", ShortName = "name")]
public string Name;
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "src")]
public string Source;

Maybe use that, with appropriate clarifications about what happens when source is null, and so on. #Resolved

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

This BTW is a general comment, I see lots of inconsistency in our documentation. Including descriptions of these columns as somehow being "inputs" and "outputs" which, as we've seen direct examples of, leads people to think about these structures in the wrong way.


In reply to: 248087222 [](ancestors = 248087222)

@@ -43,7 +43,7 @@ public void CV_Multiclass_Digits_RffTransform_OVAAveragedPerceptron()
var pipeline = mlContext.Transforms.Projection.CreateRandomFourierFeatures("Features", "FeaturesRFF")
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.Transforms.Concatenate("Features", "FeaturesRFF"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label", "Label"))

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 69, length = 9)

Now that the source column has a default value again, from what I see, we can get rid of this, right? #Resolved

@@ -27,7 +27,7 @@ public void Metacomponents()
var sdcaTrainer = ml.BinaryClassification.Trainers.StochasticDualCoordinateAscent("Label", "Features", advancedSettings: (s) => { s.MaxIterations = 100; s.Shuffle = true; s.NumThreads = 1; });

var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 66, length = 9)

Similar here. Unneeded, or should be. #Resolved

@@ -39,7 +39,7 @@ void Extensibility()
};
var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new CustomMappingEstimator<IrisData, IrisData>(ml, action, null), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 66, length = 9)

Same, unneeded. #Resolved

@@ -30,7 +30,7 @@ void DecomposableTrainAndPredict()
var data = ml.Data.ReadFromTextFile<IrisData>(dataPath, separatorChar: ',');

var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 66, length = 9)

Same, unneeded. #Resolved

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

        public ColumnInfo(string input, string output,

This one was missed. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:230 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

            var binaryCols = new List<(string input, string output)>();

While less essential since this is a field in an method that is not even privately visible, we've seen the mischief caused by this conflation of "input" and "output" with things that are neither inputs nor otuputs, and the ordering has all sorts of confusing far reaching consequences by confusing some people. Consider making this consistent. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

            var binaryCols = new List<(string input, string output)>();

Obviously far less essential than getting the public surface area right.


In reply to: 454578490 [](ancestors = 454578490)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

        string outputColumn = null, OneHotEncodingTransformer.OutputKind outputKind = Defaults.OutKind)

Another one missed. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:218 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

            IReadOnlyDictionary<PipelineColumn, string> outputNames,

Hmmm. I wonder if I indirectly bear some fault for the confusion, since in my static API, I do consider the operations to be on columns, as opposed to IEstimator/ITransformer which considers operations to be on dataviews/schemas and whatnot. Not something to change, just something that amused me. #Closed


Refers to: src/Microsoft.ML.StaticPipe/TransformsStatic.cs:578 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

public readonly string Input;
public readonly string Output;
public readonly string Source;
public readonly string Name;

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

Name first. #Resolved

@@ -114,10 +114,10 @@ public ColumnInfo(string input, string output = null, bool bag = KeyToVectorMapp
public IReadOnlyCollection<ColumnInfo> Columns => _columns.AsReadOnly();
private readonly ColumnInfo[] _columns;

private static (string input, string output)[] GetColumnPairs(ColumnInfo[] columns)
private static (string source, string name)[] GetColumnPairs(ColumnInfo[] columns)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

source, string name [](start = 31, length = 19)

Probably best to make consistent as we're trying to do elsewhere, even if it is private. #Resolved

This comment has been minimized.

@sfilipi

sfilipi Jan 16, 2019

Member

I will address it in my next PR, because they get used in the Transformer, which passes it to the base class OneToOnetransformer. Once that changes everything needs to change.


In reply to: 248091617 [](ancestors = 248091617)

@@ -31,16 +31,16 @@ public static void KeyToValue_Term()
// making use of default settings.
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizingEstimator(ml, "Review")
.Append(new ValueToKeyMappingEstimator(ml, "Review", defaultColumnName));
var default_pipeline = mlContext.Transforms.Text.TokenizeWords("Review")

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

default_pipeline [](start = 16, length = 16)

Ugh. Our samples have been written with Python variable naming conventions, not C# ones? I'll have to file an issue about this... #2155. #WontFix

@@ -76,7 +76,7 @@ public void MetacomponentsFeaturesRenamed()

var sdcaTrainer = new SdcaBinaryTrainer(Env, "Label", "Vars", advancedSettings: (s) => { s.MaxIterations = 100; s.Shuffle = true; s.NumThreads = 1; });
var pipeline = new ColumnConcatenatingEstimator(Env, "Vars", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(Env, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(Env, "Label", "Label"), TransformerScope.TrainTest)

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 67, length = 9)

Another unnecessary. #Resolved

@@ -193,7 +193,7 @@ private TextLoader.Arguments GetIrisLoaderArgs()
}
}).Read(GetDataPath(IrisDataPath));

var pipeline = new ValueToKeyMappingEstimator(Env, "Label");
var pipeline = new ValueToKeyMappingEstimator(Env, "Label", "Label");

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

, "Label" [](start = 70, length = 9)

Another unnecessary. #Resolved

@@ -249,7 +249,7 @@ public void NgramWorkout()
.Read(sentimentDataPath);

var est = new WordTokenizingEstimator(Env, "text", "text")
.Append(new ValueToKeyMappingEstimator(Env, "text", "terms"))
.Append(new ValueToKeyMappingEstimator(Env, "terms", "text"))
.Append(new NgramExtractingEstimator(Env, "terms", "ngrams"))

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

NgramExtractingEstimator [](start = 28, length = 24)

Hmmm. This didn't get fixed? #Pending

This comment has been minimized.

@sfilipi

sfilipi Jan 16, 2019

Member

I am not converting the NGramExtracting Estimator on this PR. It will come in the next PR.


In reply to: 248095484 [](ancestors = 248095484)

@@ -249,7 +249,7 @@ public void NgramWorkout()
.Read(sentimentDataPath);

var est = new WordTokenizingEstimator(Env, "text", "text")
.Append(new ValueToKeyMappingEstimator(Env, "text", "terms"))
.Append(new ValueToKeyMappingEstimator(Env, "terms", "text"))
.Append(new NgramExtractingEstimator(Env, "terms", "ngrams"))
.Append(new NgramHashingEstimator(Env, "terms", "ngramshash"));

This comment has been minimized.

@TomFinley

TomFinley Jan 15, 2019

Contributor

NgramHashingEstimator [](start = 28, length = 21)

Nor this? #Pending

This comment has been minimized.

@sfilipi

sfilipi Jan 16, 2019

Member

I am not converting the NGramHashing Estimator on this PR. It will come in the next PR.


In reply to: 248095509 [](ancestors = 248095509)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Jan 15, 2019

        public ColumnInfo(string input, string output = null,

Another. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:192 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 16, 2019

        public ColumnInfo(string input, string output = null,

I did write a comment that i am not swapping for the Transformers on this PR. the column info is shared between the estimators and the transformers.

The reason I am not swapping for the transformers, is because they all pass the columns to base - OneToOneTransformer; and if you swap them in there, you have to do all the transforms. This was meant to be a small PR to just test understanding of the changes, reach agreement. I am tackling all of them in the next PR.


In reply to: 454584468 [](ancestors = 454584468)


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:192 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 16, 2019

        string outputColumn = null, OneHotEncodingTransformer.OutputKind outputKind = Defaults.OutKind)

I am not doing all the estimators in this PR; just the one for the ConversionsExtensionsCatalog. The reason this file is here, is because of the changes in KeyToVector.


In reply to: 454578807 [](ancestors = 454578807)


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:218 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 16, 2019

        public ColumnInfo(string input, string output,

I am not doing the OneHotHash in the PR. The changes below are for Hashing.


In reply to: 454577719 [](ancestors = 454577719)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:230 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 16, 2019

            var binaryCols = new List<(string input, string output)>();

I will do it in the next PR, where the order gets swapped on the OneHotEncodingTransformer.

This PR has not addressed that component.


In reply to: 454578613 [](ancestors = 454578613,454578490)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi sfilipi force-pushed the sfilipi:trasformerEstimatorInputOutputSwap branch from 520cb79 to 60ab6cd Jan 16, 2019

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Jan 18, 2019

@glebuk, @TomFinley, @Ivanidzo4ka would you have time for a second look? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment