Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hiding of ColumnOptions #2959

Merged
merged 9 commits into from Mar 19, 2019

Conversation

@artidoro
Copy link
Contributor

artidoro commented Mar 14, 2019

First step towards solving issue #2884.

The objective of this PR is to internalize the code that uses ColumnOptions.
All the commits are logically ordered.

  1. Internalization of the extensions using ColumnOptions
  2. Internalization of the ColumnOptions living on the estimators
  3. Update the extensions to make all the arguments in the ColumnOptions available through the extensions
  4. Updated the samples when possible, or moved them to the test folder when they contained ColumnOptions
  5. Other small changes to make the code compile

@artidoro artidoro added the api label Mar 14, 2019

@artidoro artidoro added this to the 0319 milestone Mar 14, 2019

@artidoro artidoro self-assigned this Mar 14, 2019

@artidoro

This comment has been minimized.

Copy link
Contributor Author

artidoro commented Mar 14, 2019

I am only aware of one extension in which all arguments were not made available: Normalize. It used to take ColumnOptionsBase and it would have required the introduction of 5 new extensions. I am not sure if it's in our interest to do that. I would suggest to wait until we make Options available.

@@ -105,7 +105,6 @@ public void InspectPipelineSchema()

// Define a pipeline
var pipeline = mlContext.Transforms.Concatenate("Features", HousingRegression.Features)
.Append(mlContext.Transforms.Normalize())
.AppendCacheCheckpoint(mlContext)

This comment has been minimized.

@artidoro

artidoro Mar 14, 2019

Author Contributor

This was not doing anything, because no column was specified (if I specified the Feature column, it alters the test behavior).
Rogan was fine with me removing the line here and below. #Resolved

@artidoro artidoro requested a review from abgoswam Mar 14, 2019

@codecov

This comment has been minimized.

Copy link

codecov bot commented Mar 14, 2019

Codecov Report

Merging #2959 into master will decrease coverage by <.01%.
The diff coverage is 67%.

@@            Coverage Diff             @@
##           master    #2959      +/-   ##
==========================================
- Coverage   72.35%   72.34%   -0.01%     
==========================================
  Files         803      803              
  Lines      143296   143331      +35     
  Branches    16155    16170      +15     
==========================================
+ Hits       103675   103686      +11     
- Misses      35194    35224      +30     
+ Partials     4427     4421       -6
Flag Coverage Δ
#Debug 72.34% <67%> (-0.01%) ⬇️
#production 68.04% <46.77%> (-0.02%) ⬇️
#test 88.53% <100%> (ø) ⬆️
Impacted Files Coverage Δ
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 85.78% <ø> (+0.15%) ⬆️
src/Microsoft.ML.Transforms/Text/WordTokenizing.cs 79.15% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/NormalizerCatalog.cs 71.42% <ø> (ø) ⬆️
...t.ML.Data/Transforms/ValueToKeyMappingEstimator.cs 88.67% <ø> (+1.88%) ⬆️
src/Microsoft.ML.ImageAnalytics/ImageResizer.cs 84.64% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/NgramTransform.cs 87.42% <ø> (ø) ⬆️
...c/Microsoft.ML.Transforms/CountFeatureSelection.cs 92.92% <ø> (ø) ⬆️
...icrosoft.ML.Mkl.Components/MklComponentsCatalog.cs 65.71% <ø> (ø) ⬆️
src/Microsoft.ML.PCA/PCACatalog.cs 92.3% <ø> (ø) ⬆️
...rc/Microsoft.ML.Transforms/HashJoiningTransform.cs 85.75% <ø> (ø) ⬆️
... and 50 more
@@ -259,6 +337,31 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// ]]></format>
/// </example>
public static ValueMappingEstimator MapValue(
this TransformsCatalog.ConversionTransforms catalog,
IDataView lookupMap, string keyColumnName, string valueColumnName, string outputColumnName, string inputColumnName = null)

This comment has been minimized.

@TomFinley

TomFinley Mar 15, 2019

Contributor

keyColumnName, string valueColumnName [](start = 40, length = 37)

Given that you are providing an actual real honest-to-god dataview as an argument, and not merely the promise for one, the keyColumnName and valueColumnName argument should almost certainly be a DataViewSchema.Column over that data view. #Resolved

This comment has been minimized.

@TomFinley

TomFinley Mar 15, 2019

Contributor

Note that it is fine for us to internally just pass along the name in our internal infrastructure, but our external infrastructure should, if we can, allow checking of column (which in this case we totally can, since we have the data view right here!).


In reply to: 265996171 [](ancestors = 265996171)

This comment has been minimized.

@artidoro

artidoro Mar 18, 2019

Author Contributor

As I mentioned earlier, this is not an API that I am building, it is what we had in our public surface so far. I am happy to improve it if possible however.

We currently already check that the specified columNames match existing columns in the IDataView. Could you explain why using a DataViewSchema.Column would make this simpler from a user stand point?


In reply to: 265996567 [](ancestors = 265996567,265996171)

This comment has been minimized.

@wschin

wschin Mar 19, 2019

Member

Given that a user already has an IDataView, how would they call this function? I guess

MapValue("Out", lookupMap, lookupMap.Schema["Key"], lookupMap.Schema["Value"], ...)

is safer than

MapValue("Out", lookupMap, "Key", "Value", ...)

In reply to: 266647551 [](ancestors = 266647551,265996567,265996171)

This comment has been minimized.

@TomFinley

TomFinley Mar 19, 2019

Contributor

Ultimately yes, we have when possible been trying to shift to that style of API (see, e.g., the work mostly @sfilipi and possibly some others have been doing around shifting the IDataView APIs themselves). #Resolved

This comment has been minimized.

@artidoro

artidoro Mar 19, 2019

Author Contributor

Ok sounds good, I am changing it to DataViewSchema.Column.


In reply to: 266695209 [](ancestors = 266695209)

/// ]]></format>
/// </example>
public static ImageLoadingEstimator LoadImages(this TransformsCatalog catalog, string imageFolder, string outputColumnName, string inputColumnName = null)
=> new ImageLoadingEstimator(CatalogUtils.GetEnvironment(catalog), imageFolder, new[] { (outputColumnName, inputColumnName ?? outputColumnName) });

This comment has been minimized.

@wschin

wschin Mar 15, 2019

Member

Suggested signature:

        public static ImageLoadingEstimator LoadImages(this TransformsCatalog catalog, outputColumnName, string inputColumnName = null, string imageFolder= ".")

``` #Resolved

This comment has been minimized.

@artidoro

artidoro Mar 18, 2019

Author Contributor

I changed the order of the arguments around, but I kept imageFolder as a required argument.


In reply to: 266130543 [](ancestors = 266130543)

@wschin

This comment has been minimized.

Copy link
Member

wschin commented Mar 15, 2019

namespace Microsoft.ML.Samples.Dynamic

Samples? #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Projection/VectorWhitenWithColumnOptions.cs:6 in 2e27634. [](commit_id = 2e27634, deletion_comment = False)

@wschin

This comment has been minimized.

Copy link
Member

wschin commented Mar 15, 2019

        // The transformed (projected) data.

This is a sample. Can't we use your new extension here and make it a sample again? Or an equvialent example already exists? #Resolved


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Projection/VectorWhitenWithColumnOptions.cs:44 in 2e27634. [](commit_id = 2e27634, deletion_comment = False)

@artidoro

This comment has been minimized.

Copy link
Contributor Author

artidoro commented Mar 18, 2019

        // The transformed (projected) data.

So there is another example, but it's true that we don't use the multiculumn mapping in this example, just the advanced options which now are part of the extension. I added it back to the samples repo.


In reply to: 473427015 [](ancestors = 473427015)


Refers to: docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Projection/VectorWhitenWithColumnOptions.cs:44 in 2e27634. [](commit_id = 2e27634, deletion_comment = False)

artidoro added some commits Mar 19, 2019

"Postgraduate"
new KeyValuePair<string,string>("0-5yrs", "Undergraduate"),
new KeyValuePair<string,string>("6-11yrs", "Postgraduate"),
new KeyValuePair<string,string>("12+yrs", "Postgraduate")
};

This comment has been minimized.

@TomFinley

TomFinley Mar 19, 2019

Contributor

So you have this code here.

var educationKeyValuePairs = new List<KeyValuePair<string,string>>()
{
    new KeyValuePair<string,string>("0-5yrs", "Undergraduate"),
    new KeyValuePair<string,string>("6-11yrs", "Postgraduate"),
    new KeyValuePair<string,string>("12+yrs", "Postgraduate")
};

I mean, that works I guess, it is technically not incorrect, but the most obvious thing that comes to mind when I hear something that implements an IEnumerable<KeyValuePair<A, B>> is Dictionary<A, B>. So this could have been written like so.

var educationMap = new Dictionary<string, string>();
educationMap["0-5yrs"] = "Undergraduate";
educationMap["6-11yrs"] = "Postgraduate";
educationMap["12+yrs"] = "Postgraduate";

Then you just use educationMap as you use your educationKeyValuePairs.

Not saying you have to do it that way, but given that this is a sample and meant to be read, it might be nice if we kept "code oddity" to a minimum. (Though I don't insist on this, we can fix samples later. Only if you have time.) #Resolved

@TomFinley
Copy link
Contributor

TomFinley left a comment

Thank you @artidoro !

/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
/// <param name="rehashUnigrams">Whether to rehash unigrams.</param>
public static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog.TextTransforms catalog,
string outputColumnName,

This comment has been minimized.

@wschin

wschin Mar 19, 2019

Member

One nuclear question. Do we have tests for these new functions?

This comment has been minimized.

@artidoro

artidoro Mar 19, 2019

Author Contributor

You are right some new extensions are not being tested.
I opened an issue about it #3013.

@wschin

wschin approved these changes Mar 19, 2019

Copy link
Member

wschin left a comment

Approved with a major comment about test. Please open an issue if that can't be addressed.

@artidoro artidoro merged commit 8b1b14f into dotnet:master Mar 19, 2019

3 checks passed

MachineLearning-CI #20190319.10 succeeded
Details
MachineLearning-CodeCoverage #20190319.10 succeeded
Details
license/cla All CLA requirements met.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.