Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monotone constraint support for LightGBM #2330

Closed
wants to merge 12 commits into from
Closed
10 changes: 10 additions & 0 deletions src/Microsoft.ML.LightGBM/LightGbmArguments.cs
Original file line number Diff line number Diff line change
Expand Up @@ -371,6 +371,15 @@ public enum EvalMetricType
[TlcModule.SweepableDiscreteParam("CatL2", new object[] { 0.1, 0.5, 1, 5, 10 })]
public double CatL2 = 10;

[Argument(ArgumentType.Multiple,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argument [](start = 9, length = 8)

please add ShortName, if someone would ever need to use that in command line, he would curse you for typing this words again and again. #Closed

HelpText = "Sets the constraints for monotonic features. This is a 0 based index for each feature in " +
"the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is " +
"specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the " +
"first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, " +
"then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.",
ShortName="mc")]
public string[] MonotoneConstraints;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public string[] MonotoneConstraints; [](start = 8, length = 36)

A) please unblock and run test RegenerateEntryPointCatalog on your local machine and push changes in core_ep-list.tsv.
B) How it would handle null value? I don't see test covering that.
C) How it would handle gibberish values? Like Bubba, TabbyCat ? What kind of exception it throw, was that exception be enough to understand something is wrong with this parameter? #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MonotoneConstraints [](start = 24, length = 19)

Sorry what I bring it on 9th iteration, but just curious, wouldn't it be easier to have two options, one is positive monotone constraints, and other negative with types int[]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or array of Range, or what class we use in textloader to specify range of columns.


In reply to: 254096866 [](ancestors = 254096866)

Copy link
Contributor

@justinormont justinormont Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the TextLoader's text-based ranges offers interesting abilities, like "~" and "*". For example setting "pos:3-*" so the user doesn't need to know the total number of slots.

The current method is interesting as it allows for setting everything positive, then some negative "pos neg:5-10". The gain here is the user doesn't need to know the total number of slots.

I don't see a shortcut to set positive on all except for a few: "pos neutral:5-10". If you know the total number of slots, you could do "pos:0-4,11-547542".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So initially I did not use TextLoader Range as it looked to contain properties specific to Text Loading -- however, your comments resonated with me so I have made some changes:

  1. I created a new Range class in Microsoft.ML.Core - this is meant to be a more generic Range class that can be used by other Arguments if a range is needed. Tests were added as well.
  2. I removed MonotonicConstraints as an argument and added MonotonicPositive and MonotonicNegative - these are now Range arrays.
  3. I updated tests and baseline files to reflect these changes.

Overall I like this better as it follows the range format that is already used for text loader. If you want to select all features you can use the * and set the range to be 0-*. Short names for these arguments are mp and mn.

Take a look - Im curious to what you think. There maybe an opportunity to merge the Microsoft.ML.Core Range and the TextLoader Range, but I think that should be a different discussion.


In reply to: 254097097 [](ancestors = 254097097,254096866)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't expect you to take me seriously on this proposal, I was just thinking out loud.
I'm fine with changes, but I would like @TomFinley to give his opinion regarding new Range class, and how it fit in our API story.

Copy link
Member Author

@singlis singlis Feb 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ivanidzo4ka, it looks this PR fails .netcore30 builds because there is now a System.Range and that is causing some compiler confusion as to which one is being referenced. So I am curious as to what @TomFinley has to say.


[Argument(ArgumentType.Multiple, HelpText = "Parallel LightGBM Learning Algorithm", ShortName = "parag")]
public ISupportParallel ParallelTrainer = new SingleTrainerFactory();

Expand Down Expand Up @@ -428,6 +437,7 @@ public enum EvalMetricType
res[GetArgName(nameof(MaxCatThreshold))] = MaxCatThreshold;
res[GetArgName(nameof(CatSmooth))] = CatSmooth;
res[GetArgName(nameof(CatL2))] = CatL2;
res[GetArgName(nameof(MonotoneConstraints))] = (object)MonotoneConstraints;
return res;
}
}
Expand Down
2 changes: 1 addition & 1 deletion src/Microsoft.ML.LightGBM/LightGbmMulticlassTrainer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ public sealed class LightGbmMulticlassTrainer : LightGbmTrainerBase<VBuffer<floa
public override PredictionKind PredictionKind => PredictionKind.MultiClassClassification;

internal LightGbmMulticlassTrainer(IHostEnvironment env, Options options)
: base(env, LoadNameValue, options, TrainerUtils.MakeBoolScalarLabel(options.LabelColumn))
: base(env, LoadNameValue, options, TrainerUtils.MakeU4ScalarColumn(options.LabelColumn))
{
_numClass = -1;
}
Expand Down
116 changes: 114 additions & 2 deletions src/Microsoft.ML.LightGBM/LightGbmTrainerBase.cs
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data;
using Microsoft.ML.EntryPoints;
Expand Down Expand Up @@ -89,6 +90,116 @@ private sealed class CategoricalMetaData
InitParallelTraining();
}

[BestFriend]
internal void ExpandMonotoneConstraint(ref Dictionary<string, object> options, int featureCount)
{
string monotoneArgumentName = "monotone_constraints";
if (!options.ContainsKey(monotoneArgumentName))
return;

// If the constraints arguments is not a string array, then return
if (!(options[monotoneArgumentName] is string[]))
return;

string[] constraintArguments = (string[])options[monotoneArgumentName];
if (constraintArguments == null || constraintArguments.Length == 0)
{
options.Remove(monotoneArgumentName);
return;
}

// Convert the monotone constraints parameter to be consumed by LightGBM.
// Format of the constraint is the key word that specifies
// the constraint type: pos for positive constraint (which results to
// 1 for LightGBM) and neg for negative constraint (which results to -1
// for LightGBM).
// For example:
// pos:0-3,5
// neg:6-7
// Since the argument is a string array, mulitple ranges can be specified.
// In the event of the same index being specified twice, the last one wins.
// The end result is a string array of 1s,0s, and -1s that should be equal
Copy link
Contributor

@justinormont justinormont Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// The end result is a string array of 1s,0s, and -1s that should be equal
// The end result is a string array of 1s, 0s, and -1s that should be equal
``` #Resolved

// to the number of features in the Feature column.
int[] constraintArray = new int[featureCount];
const string positiveKeyword = "pos";
const string negativeKeyword = "neg";

foreach (var argument in constraintArguments)
{
// Split by : to get the keyword and range
var subArguments = argument.Split(':');
if (subArguments.Length > 2)
// Invalid argument, skip to the next argument
throw Contracts.Except(Host, $"Invalid argument {argument}");

var keyword = subArguments[0].ToLowerInvariant();
int constraint = 0;
if (keyword.Equals(positiveKeyword, StringComparison.OrdinalIgnoreCase))
constraint = 1;
else if (keyword.Equals(negativeKeyword, StringComparison.OrdinalIgnoreCase))
constraint = -1;
else
throw Contracts.Except(Host, $"Unsupported keyword {keyword}");

// If only the keyword (pos or neg) is present without a range, this will
// set the same constraint for all features.
if (subArguments.Length == 1)
{
for (int i = 0; i < featureCount; i++)
constraintArray[i] = constraint;
continue;
}

var rangesArgument = subArguments[1];

// Parse the range. Since multiple ranges
// can be specified in a single range, split by comma first
var indexRanges = new List<(int min, int max)>();
int min;
int max;
var ranges = rangesArgument.Split(',');
if (ranges.Length == 0)
continue;

foreach (var range in ranges)
{
min = 0;
max = 0;

// Split by -
var minMax = range.Split('-');
if (minMax.Length > 2)
throw Contracts.Except(Host, $"Invalid range specified {range}");

if (minMax.Length == 1)
{
// single variable
if (int.TryParse(minMax[0], out min) &&
min >= 0 && min < featureCount)
indexRanges.Add((min, min));
}
else
{
if (int.TryParse(minMax[0], out min) &&
int.TryParse(minMax[1], out max) &&
min < max &&
min >= 0 && min < featureCount &&
max > 0 && max < featureCount)

indexRanges.Add((min, max));
}
}

// Process each range
foreach (var indexRange in indexRanges)
for (int i = indexRange.min; i <= indexRange.max; ++i)
constraintArray[i] = constraint;
}
// Update Options to contain the expanded array
var optionString = string.Join(",", constraintArray);
options[monotoneArgumentName] = optionString;
}

private protected LightGbmTrainerBase(IHostEnvironment env, string name, Options options, SchemaShape.Column label)
: base(Contracts.CheckRef(env, nameof(env)).Register(name), TrainerUtils.MakeR4VecFeature(options.FeatureColumn), label, TrainerUtils.MakeR4ScalarWeightColumn(options.WeightColumn))
{
Expand Down Expand Up @@ -169,7 +280,7 @@ private protected virtual void CheckDataValid(IChannel ch, RoleMappedData data)
ch.CheckParam(data.Schema.Label.HasValue, nameof(data), "Need a label column");
}

protected virtual void GetDefaultParameters(IChannel ch, int numRow, bool hasCategarical, int totalCats, bool hiddenMsg=false)
protected virtual void GetDefaultParameters(IChannel ch, int numRow, bool hasCategarical, int totalCats, bool hiddenMsg = false)
{
double learningRate = Args.LearningRate ?? DefaultLearningRate(numRow, hasCategarical, totalCats);
int numLeaves = Args.NumLeaves ?? DefaultNumLeaves(numRow, hasCategarical, totalCats);
Expand Down Expand Up @@ -318,6 +429,7 @@ private Dataset LoadTrainingData(IChannel ch, RoleMappedData trainData, out Cate
GetMetainfo(ch, factory, out int numRow, out float[] labels, out float[] weights, out int[] groups);
catMetaData = GetCategoricalMetaData(ch, trainData, numRow);
GetDefaultParameters(ch, numRow, catMetaData.CategoricalBoudaries != null, catMetaData.TotalCats);
ExpandMonotoneConstraint(ref Options, catMetaData.NumCol);

Dataset dtrain;
string param = LightGbmInterfaceUtils.JoinParameters(Options);
Expand Down Expand Up @@ -590,7 +702,7 @@ private void GetFeatureValueDense(IChannel ch, FloatLabelCursor cursor, Categori
int[] nonZeroCntPerColumn = new int[catMetaData.NumCol];
int estimateNonZeroCnt = (int)(numSampleRow * density);
estimateNonZeroCnt = Math.Max(1, estimateNonZeroCnt);
for(int i = 0; i < catMetaData.NumCol; i++)
for (int i = 0; i < catMetaData.NumCol; i++)
{
nonZeroCntPerColumn[i] = 0;
sampleValuePerColumn[i] = new double[estimateNonZeroCnt];
Expand Down
60 changes: 60 additions & 0 deletions test/BaselineOutput/Common/EntryPoints/core_manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -11606,6 +11606,21 @@
]
}
},
{
"Name": "MonotoneConstraints",
"Type": {
"Kind": "Array",
"ItemType": "String"
},
"Desc": "Sets the constraints for monotonic features. This is a 0 based index for each feature in the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.",
"Aliases": [
"mc"
],
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": null
},
{
"Name": "ParallelTrainer",
"Type": {
Expand Down Expand Up @@ -12101,6 +12116,21 @@
]
}
},
{
"Name": "MonotoneConstraints",
"Type": {
"Kind": "Array",
"ItemType": "String"
},
"Desc": "Sets the constraints for monotonic features. This is a 0 based index for each feature in the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.",
"Aliases": [
"mc"
],
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": null
},
{
"Name": "ParallelTrainer",
"Type": {
Expand Down Expand Up @@ -12596,6 +12626,21 @@
]
}
},
{
"Name": "MonotoneConstraints",
"Type": {
"Kind": "Array",
"ItemType": "String"
},
"Desc": "Sets the constraints for monotonic features. This is a 0 based index for each feature in the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.",
"Aliases": [
"mc"
],
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": null
},
{
"Name": "ParallelTrainer",
"Type": {
Expand Down Expand Up @@ -13091,6 +13136,21 @@
]
}
},
{
"Name": "MonotoneConstraints",
"Type": {
"Kind": "Array",
"ItemType": "String"
},
"Desc": "Sets the constraints for monotonic features. This is a 0 based index for each feature in the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.",
"Aliases": [
"mc"
],
"Required": false,
"SortOrder": 150.0,
"IsNullable": false,
"Default": null
},
{
"Name": "ParallelTrainer",
"Type": {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
maml.exe CV tr=LightGBMBinary{nt=1 nl=5 mil=5 lr=0.25 iter=20 mb=255 mc=pos:0 mc=neg:1} threads=- cache=- dout=%Output% loader=Text{sparse- col=Attr:TX:6 col=Label:0 col=Features:1-5,6,7-9} data=%Data% seed=1
Not adding a normalizer.
Auto-tuning parameters: UseCat = False
LightGBM objective=binary
Not training a calibrator because it is not needed.
Not adding a normalizer.
Auto-tuning parameters: UseCat = False
LightGBM objective=binary
Not training a calibrator because it is not needed.
TEST POSITIVE RATIO: 0.3702 (134.0/(134.0+228.0))
Confusion table
||======================
PREDICTED || positive | negative | Recall
TRUTH ||======================
positive || 131 | 3 | 0.9776
negative || 10 | 218 | 0.9561
||======================
Precision || 0.9291 | 0.9864 |
OVERALL 0/1 ACCURACY: 0.964088
LOG LOSS/instance: 0.203994
Test-set entropy (prior Log-Loss/instance): 0.950799
LOG-LOSS REDUCTION (RIG): 78.544978
AUC: 0.985189
TEST POSITIVE RATIO: 0.3175 (107.0/(107.0+230.0))
Confusion table
||======================
PREDICTED || positive | negative | Recall
TRUTH ||======================
positive || 99 | 8 | 0.9252
negative || 7 | 223 | 0.9696
||======================
Precision || 0.9340 | 0.9654 |
OVERALL 0/1 ACCURACY: 0.955490
LOG LOSS/instance: 0.140946
Test-set entropy (prior Log-Loss/instance): 0.901650
LOG-LOSS REDUCTION (RIG): 84.367991
AUC: 0.992361

OVERALL RESULTS
---------------------------------------
AUC: 0.988775 (0.0036)
Accuracy: 0.959789 (0.0043)
Positive precision: 0.931520 (0.0024)
Positive recall: 0.951423 (0.0262)
Negative precision: 0.975897 (0.0105)
Negative recall: 0.962853 (0.0067)
Log-loss: 0.172470 (0.0315)
Log-loss reduction: 81.456484 (2.9115)
F1 Score: 0.941152 (0.0116)
AUPRC: 0.975728 (0.0096)

---------------------------------------
Physical memory usage(MB): %Number%
Virtual memory usage(MB): %Number%
%DateTime% Time elapsed(s): %Number%

--- Progress log ---
[1] 'Loading data for LightGBM' started.
[1] 'Loading data for LightGBM' finished in %Time%.
[2] 'Training with LightGBM' started.
[2] 'Training with LightGBM' finished in %Time%.
[3] 'Loading data for LightGBM #2' started.
[3] 'Loading data for LightGBM #2' finished in %Time%.
[4] 'Training with LightGBM #2' started.
[4] 'Training with LightGBM #2' finished in %Time%.
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
LightGBMBinary
AUC Accuracy Positive precision Positive recall Negative precision Negative recall Log-loss Log-loss reduction F1 Score AUPRC /iter /lr /nl /mil /nt /mc Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings
0.988775 0.959789 0.93152 0.951423 0.975897 0.962853 0.17247 81.45648 0.941152 0.975728 20 0.25 5 5 1 pos:0,neg:1 LightGBMBinary %Data% %Output% 99 0 0 maml.exe CV tr=LightGBMBinary{nt=1 nl=5 mil=5 lr=0.25 iter=20 mb=255 mc=pos:0 mc=neg:1} threads=- cache=- dout=%Output% loader=Text{sparse- col=Attr:TX:6 col=Label:0 col=Features:1-5,6,7-9} data=%Data% seed=1 /iter:20;/lr:0.25;/nl:5;/mil:5;/nt:1;/mc:pos:0,neg:1

Loading