-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monotone constraint support for LightGBM #2330
Changes from 9 commits
0e07012
3c451f4
28ad5cb
81d29a9
764ee16
7e2df45
6c819d4
136bc38
1e481e8
b22a303
e391cbc
0231e6c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -371,6 +371,15 @@ public enum EvalMetricType | |
[TlcModule.SweepableDiscreteParam("CatL2", new object[] { 0.1, 0.5, 1, 5, 10 })] | ||
public double CatL2 = 10; | ||
|
||
[Argument(ArgumentType.Multiple, | ||
HelpText = "Sets the constraints for monotonic features. This is a 0 based index for each feature in " + | ||
"the features column. A keyword of 'pos' for positive constraint or 'neg' for negative constraint is " + | ||
"specified followed by a range. For example, pos:0-2 neg:3,5 will apply a positive constraint to the " + | ||
"first three features and a negative constraint to the 4th and 6th feature. If feature index is not specified, " + | ||
"then no constraint will be applied. The keyword of 'pos' or 'neg' without a range will apply the constraint to all features.", | ||
ShortName="mc")] | ||
public string[] MonotoneConstraints; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
A) please unblock and run test RegenerateEntryPointCatalog on your local machine and push changes in core_ep-list.tsv. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry what I bring it on 9th iteration, but just curious, wouldn't it be easier to have two options, one is positive monotone constraints, and other negative with types int[]? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or array of Range, or what class we use in textloader to specify range of columns. In reply to: 254096866 [](ancestors = 254096866) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Using the TextLoader's text-based ranges offers interesting abilities, like "~" and " The current method is interesting as it allows for setting everything positive, then some negative "pos neg:5-10". The gain here is the user doesn't need to know the total number of slots. I don't see a shortcut to set positive on all except for a few: "pos neutral:5-10". If you know the total number of slots, you could do "pos:0-4,11-547542". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So initially I did not use TextLoader Range as it looked to contain properties specific to Text Loading -- however, your comments resonated with me so I have made some changes:
Overall I like this better as it follows the range format that is already used for text loader. If you want to select all features you can use the * and set the range to be 0-*. Short names for these arguments are mp and mn. Take a look - Im curious to what you think. There maybe an opportunity to merge the Microsoft.ML.Core Range and the TextLoader Range, but I think that should be a different discussion. In reply to: 254097097 [](ancestors = 254097097,254096866) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't expect you to take me seriously on this proposal, I was just thinking out loud. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks @Ivanidzo4ka, it looks this PR fails .netcore30 builds because there is now a System.Range and that is causing some compiler confusion as to which one is being referenced. So I am curious as to what @TomFinley has to say. |
||
|
||
[Argument(ArgumentType.Multiple, HelpText = "Parallel LightGBM Learning Algorithm", ShortName = "parag")] | ||
public ISupportParallel ParallelTrainer = new SingleTrainerFactory(); | ||
|
||
|
@@ -428,6 +437,7 @@ public enum EvalMetricType | |
res[GetArgName(nameof(MaxCatThreshold))] = MaxCatThreshold; | ||
res[GetArgName(nameof(CatSmooth))] = CatSmooth; | ||
res[GetArgName(nameof(CatL2))] = CatL2; | ||
res[GetArgName(nameof(MonotoneConstraints))] = (object)MonotoneConstraints; | ||
return res; | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -4,6 +4,7 @@ | |||||||
|
||||||||
using System; | ||||||||
using System.Collections.Generic; | ||||||||
using System.Linq; | ||||||||
using Microsoft.ML.Core.Data; | ||||||||
using Microsoft.ML.Data; | ||||||||
using Microsoft.ML.EntryPoints; | ||||||||
|
@@ -89,6 +90,116 @@ private sealed class CategoricalMetaData | |||||||
InitParallelTraining(); | ||||||||
} | ||||||||
|
||||||||
[BestFriend] | ||||||||
internal void ExpandMonotoneConstraint(ref Dictionary<string, object> options, int featureCount) | ||||||||
{ | ||||||||
string monotoneArgumentName = "monotone_constraints"; | ||||||||
if (!options.ContainsKey(monotoneArgumentName)) | ||||||||
return; | ||||||||
|
||||||||
// If the constraints arguments is not a string array, then return | ||||||||
if (!(options[monotoneArgumentName] is string[])) | ||||||||
return; | ||||||||
|
||||||||
string[] constraintArguments = (string[])options[monotoneArgumentName]; | ||||||||
if (constraintArguments == null || constraintArguments.Length == 0) | ||||||||
{ | ||||||||
options.Remove(monotoneArgumentName); | ||||||||
return; | ||||||||
} | ||||||||
|
||||||||
// Convert the monotone constraints parameter to be consumed by LightGBM. | ||||||||
// Format of the constraint is the key word that specifies | ||||||||
// the constraint type: pos for positive constraint (which results to | ||||||||
// 1 for LightGBM) and neg for negative constraint (which results to -1 | ||||||||
// for LightGBM). | ||||||||
// For example: | ||||||||
// pos:0-3,5 | ||||||||
// neg:6-7 | ||||||||
// Since the argument is a string array, mulitple ranges can be specified. | ||||||||
// In the event of the same index being specified twice, the last one wins. | ||||||||
// The end result is a string array of 1s,0s, and -1s that should be equal | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||
// to the number of features in the Feature column. | ||||||||
int[] constraintArray = new int[featureCount]; | ||||||||
const string positiveKeyword = "pos"; | ||||||||
const string negativeKeyword = "neg"; | ||||||||
|
||||||||
foreach (var argument in constraintArguments) | ||||||||
{ | ||||||||
// Split by : to get the keyword and range | ||||||||
var subArguments = argument.Split(':'); | ||||||||
if (subArguments.Length > 2) | ||||||||
// Invalid argument, skip to the next argument | ||||||||
throw Contracts.Except(Host, $"Invalid argument {argument}"); | ||||||||
|
||||||||
var keyword = subArguments[0].ToLowerInvariant(); | ||||||||
int constraint = 0; | ||||||||
if (keyword.Equals(positiveKeyword, StringComparison.OrdinalIgnoreCase)) | ||||||||
constraint = 1; | ||||||||
else if (keyword.Equals(negativeKeyword, StringComparison.OrdinalIgnoreCase)) | ||||||||
constraint = -1; | ||||||||
else | ||||||||
throw Contracts.Except(Host, $"Unsupported keyword {keyword}"); | ||||||||
|
||||||||
// If only the keyword (pos or neg) is present without a range, this will | ||||||||
// set the same constraint for all features. | ||||||||
if (subArguments.Length == 1) | ||||||||
{ | ||||||||
for (int i = 0; i < featureCount; i++) | ||||||||
constraintArray[i] = constraint; | ||||||||
continue; | ||||||||
} | ||||||||
|
||||||||
var rangesArgument = subArguments[1]; | ||||||||
|
||||||||
// Parse the range. Since multiple ranges | ||||||||
// can be specified in a single range, split by comma first | ||||||||
var indexRanges = new List<(int min, int max)>(); | ||||||||
int min; | ||||||||
int max; | ||||||||
var ranges = rangesArgument.Split(','); | ||||||||
if (ranges.Length == 0) | ||||||||
continue; | ||||||||
|
||||||||
foreach (var range in ranges) | ||||||||
{ | ||||||||
min = 0; | ||||||||
max = 0; | ||||||||
|
||||||||
// Split by - | ||||||||
var minMax = range.Split('-'); | ||||||||
if (minMax.Length > 2) | ||||||||
throw Contracts.Except(Host, $"Invalid range specified {range}"); | ||||||||
|
||||||||
if (minMax.Length == 1) | ||||||||
{ | ||||||||
// single variable | ||||||||
if (int.TryParse(minMax[0], out min) && | ||||||||
min >= 0 && min < featureCount) | ||||||||
indexRanges.Add((min, min)); | ||||||||
} | ||||||||
else | ||||||||
{ | ||||||||
if (int.TryParse(minMax[0], out min) && | ||||||||
int.TryParse(minMax[1], out max) && | ||||||||
min < max && | ||||||||
min >= 0 && min < featureCount && | ||||||||
max > 0 && max < featureCount) | ||||||||
|
||||||||
indexRanges.Add((min, max)); | ||||||||
} | ||||||||
} | ||||||||
|
||||||||
// Process each range | ||||||||
foreach (var indexRange in indexRanges) | ||||||||
for (int i = indexRange.min; i <= indexRange.max; ++i) | ||||||||
constraintArray[i] = constraint; | ||||||||
} | ||||||||
// Update Options to contain the expanded array | ||||||||
var optionString = string.Join(",", constraintArray); | ||||||||
options[monotoneArgumentName] = optionString; | ||||||||
} | ||||||||
|
||||||||
private protected LightGbmTrainerBase(IHostEnvironment env, string name, Options options, SchemaShape.Column label) | ||||||||
: base(Contracts.CheckRef(env, nameof(env)).Register(name), TrainerUtils.MakeR4VecFeature(options.FeatureColumn), label, TrainerUtils.MakeR4ScalarWeightColumn(options.WeightColumn)) | ||||||||
{ | ||||||||
|
@@ -169,7 +280,7 @@ private protected virtual void CheckDataValid(IChannel ch, RoleMappedData data) | |||||||
ch.CheckParam(data.Schema.Label.HasValue, nameof(data), "Need a label column"); | ||||||||
} | ||||||||
|
||||||||
protected virtual void GetDefaultParameters(IChannel ch, int numRow, bool hasCategarical, int totalCats, bool hiddenMsg=false) | ||||||||
protected virtual void GetDefaultParameters(IChannel ch, int numRow, bool hasCategarical, int totalCats, bool hiddenMsg = false) | ||||||||
{ | ||||||||
double learningRate = Args.LearningRate ?? DefaultLearningRate(numRow, hasCategarical, totalCats); | ||||||||
int numLeaves = Args.NumLeaves ?? DefaultNumLeaves(numRow, hasCategarical, totalCats); | ||||||||
|
@@ -318,6 +429,7 @@ private Dataset LoadTrainingData(IChannel ch, RoleMappedData trainData, out Cate | |||||||
GetMetainfo(ch, factory, out int numRow, out float[] labels, out float[] weights, out int[] groups); | ||||||||
catMetaData = GetCategoricalMetaData(ch, trainData, numRow); | ||||||||
GetDefaultParameters(ch, numRow, catMetaData.CategoricalBoudaries != null, catMetaData.TotalCats); | ||||||||
ExpandMonotoneConstraint(ref Options, catMetaData.NumCol); | ||||||||
|
||||||||
Dataset dtrain; | ||||||||
string param = LightGbmInterfaceUtils.JoinParameters(Options); | ||||||||
|
@@ -590,7 +702,7 @@ private void GetFeatureValueDense(IChannel ch, FloatLabelCursor cursor, Categori | |||||||
int[] nonZeroCntPerColumn = new int[catMetaData.NumCol]; | ||||||||
int estimateNonZeroCnt = (int)(numSampleRow * density); | ||||||||
estimateNonZeroCnt = Math.Max(1, estimateNonZeroCnt); | ||||||||
for(int i = 0; i < catMetaData.NumCol; i++) | ||||||||
for (int i = 0; i < catMetaData.NumCol; i++) | ||||||||
{ | ||||||||
nonZeroCntPerColumn[i] = 0; | ||||||||
sampleValuePerColumn[i] = new double[estimateNonZeroCnt]; | ||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
maml.exe CV tr=LightGBMBinary{nt=1 nl=5 mil=5 lr=0.25 iter=20 mb=255 mc=pos:0 mc=neg:1} threads=- cache=- dout=%Output% loader=Text{sparse- col=Attr:TX:6 col=Label:0 col=Features:1-5,6,7-9} data=%Data% seed=1 | ||
Not adding a normalizer. | ||
Auto-tuning parameters: UseCat = False | ||
LightGBM objective=binary | ||
Not training a calibrator because it is not needed. | ||
Not adding a normalizer. | ||
Auto-tuning parameters: UseCat = False | ||
LightGBM objective=binary | ||
Not training a calibrator because it is not needed. | ||
TEST POSITIVE RATIO: 0.3702 (134.0/(134.0+228.0)) | ||
Confusion table | ||
||====================== | ||
PREDICTED || positive | negative | Recall | ||
TRUTH ||====================== | ||
positive || 131 | 3 | 0.9776 | ||
negative || 10 | 218 | 0.9561 | ||
||====================== | ||
Precision || 0.9291 | 0.9864 | | ||
OVERALL 0/1 ACCURACY: 0.964088 | ||
LOG LOSS/instance: 0.203994 | ||
Test-set entropy (prior Log-Loss/instance): 0.950799 | ||
LOG-LOSS REDUCTION (RIG): 78.544978 | ||
AUC: 0.985189 | ||
TEST POSITIVE RATIO: 0.3175 (107.0/(107.0+230.0)) | ||
Confusion table | ||
||====================== | ||
PREDICTED || positive | negative | Recall | ||
TRUTH ||====================== | ||
positive || 99 | 8 | 0.9252 | ||
negative || 7 | 223 | 0.9696 | ||
||====================== | ||
Precision || 0.9340 | 0.9654 | | ||
OVERALL 0/1 ACCURACY: 0.955490 | ||
LOG LOSS/instance: 0.140946 | ||
Test-set entropy (prior Log-Loss/instance): 0.901650 | ||
LOG-LOSS REDUCTION (RIG): 84.367991 | ||
AUC: 0.992361 | ||
|
||
OVERALL RESULTS | ||
--------------------------------------- | ||
AUC: 0.988775 (0.0036) | ||
Accuracy: 0.959789 (0.0043) | ||
Positive precision: 0.931520 (0.0024) | ||
Positive recall: 0.951423 (0.0262) | ||
Negative precision: 0.975897 (0.0105) | ||
Negative recall: 0.962853 (0.0067) | ||
Log-loss: 0.172470 (0.0315) | ||
Log-loss reduction: 81.456484 (2.9115) | ||
F1 Score: 0.941152 (0.0116) | ||
AUPRC: 0.975728 (0.0096) | ||
|
||
--------------------------------------- | ||
Physical memory usage(MB): %Number% | ||
Virtual memory usage(MB): %Number% | ||
%DateTime% Time elapsed(s): %Number% | ||
|
||
--- Progress log --- | ||
[1] 'Loading data for LightGBM' started. | ||
[1] 'Loading data for LightGBM' finished in %Time%. | ||
[2] 'Training with LightGBM' started. | ||
[2] 'Training with LightGBM' finished in %Time%. | ||
[3] 'Loading data for LightGBM #2' started. | ||
[3] 'Loading data for LightGBM #2' finished in %Time%. | ||
[4] 'Training with LightGBM #2' started. | ||
[4] 'Training with LightGBM #2' finished in %Time%. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
LightGBMBinary | ||
AUC Accuracy Positive precision Positive recall Negative precision Negative recall Log-loss Log-loss reduction F1 Score AUPRC /iter /lr /nl /mil /nt /mc Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings | ||
0.988775 0.959789 0.93152 0.951423 0.975897 0.962853 0.17247 81.45648 0.941152 0.975728 20 0.25 5 5 1 pos:0,neg:1 LightGBMBinary %Data% %Output% 99 0 0 maml.exe CV tr=LightGBMBinary{nt=1 nl=5 mil=5 lr=0.25 iter=20 mb=255 mc=pos:0 mc=neg:1} threads=- cache=- dout=%Output% loader=Text{sparse- col=Attr:TX:6 col=Label:0 col=Features:1-5,6,7-9} data=%Data% seed=1 /iter:20;/lr:0.25;/nl:5;/mil:5;/nt:1;/mc:pos:0,neg:1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add ShortName, if someone would ever need to use that in command line, he would curse you for typing this words again and again. #Closed