-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception on using LightGBM trainer with FeatureContributionCalculation and OneHotEncoding #3272
Comments
@vKuryshev Thanks for reporting this issue! Definitely not the expected results :) @wschin This looks like a bug. Have you been able to reproduce it? |
I was able to reproduce this issue. Notice that there have been some changes on the API since this Issue was opened, so just for the record I'll leave here a full repro using the latest API. I will look into this. Repro - Click to expand!using System;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
namespace Bugs
{
public class TaxiTrip
{
[LoadColumn(0)]
public string VendorId;
[LoadColumn(1)]
public float RateCode;
[LoadColumn(2)]
public float PassengerCount;
[LoadColumn(3)]
public float TripTime;
[LoadColumn(4)]
public float TripDistance;
[LoadColumn(5)]
public string PaymentType;
[LoadColumn(6)]
public float FareAmount;
}
static class DefaultColumnNames
{
public const string Features = "Features";
public const string Label = "Label";
public const string GroupId = "GroupId";
public const string Name = "Name";
public const string Weight = "Weight";
public const string Score = "Score";
public const string Probability = "Probability";
public const string PredictedLabel = "PredictedLabel";
public const string RecommendedItems = "Recommended";
public const string User = "User";
public const string Item = "Item";
public const string Date = "Date";
public const string FeatureContributions = "FeatureContributions";
}
class Program
{
static void Main(string[] args)
{
var TrainDataPath = @"C:\Users\anvelazq\Desktop\mymlnet\test\data\taxi-fare-train.csv";
var TestDataPath = @"C:\Users\anvelazq\Desktop\mymlnet\test\data\taxi-fare-test.csv";
var mlContext = new MLContext();
IDataView trainingDataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(TrainDataPath, hasHeader: true, separatorChar: ',');
IDataView testDataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(TestDataPath, hasHeader: true, separatorChar: ',');
var VendorIdEncoded = "VendorIdEncoded";
var RateCodeEncoded = "RateCodeEncoded";
var PaymentTypeEncoded = "PaymentTypeEncoded";
var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: DefaultColumnNames.Label, inputColumnName: nameof(TaxiTrip.FareAmount))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: VendorIdEncoded, inputColumnName: nameof(TaxiTrip.VendorId)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: RateCodeEncoded, inputColumnName: nameof(TaxiTrip.RateCode)))
.Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: PaymentTypeEncoded, inputColumnName: nameof(TaxiTrip.PaymentType)))
.Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.PassengerCount)))
.Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.TripTime)))
.Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.TripDistance)))
.Append(mlContext.Transforms.Concatenate(DefaultColumnNames.Features, VendorIdEncoded, RateCodeEncoded, PaymentTypeEncoded, nameof(TaxiTrip.PassengerCount)
, nameof(TaxiTrip.TripTime), nameof(TaxiTrip.TripDistance)));
var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: DefaultColumnNames.Label, featureColumnName: DefaultColumnNames.Features);
var trainingPipeline = dataProcessPipeline.Append(trainer);
var trainedModel = trainingPipeline.Fit(trainingDataView);
IDataView predictions = trainedModel.Transform(testDataView);
var featureContributionCalculation = mlContext.Transforms.CalculateFeatureContribution(trainedModel.LastTransformer, normalize: false);
var featureContributionData = featureContributionCalculation.Fit(predictions).Transform(predictions);
var contributions = featureContributionData.GetColumn<float[]>(DefaultColumnNames.FeatureContributions).ToList();
}
}
} |
I've noticed that the exception dissapears if I set This happens because if the training set has over 50,000 rows, then the GetCategoricalMetaData method will behave just as if I had explicitly set machinelearning/src/Microsoft.ML.LightGbm/LightGbmTrainerBase.cs Lines 532 to 541 in 1ea5336
I am still not sure why this affects the sample and causes the exception. But it's clear it only happens if LightGBM uses "categorical split" |
TL;DR: This exception happens because The particular exception of this sample comes because any given machinelearning/src/Microsoft.ML.FastTree/TreeEnsemble/InternalRegressionTree.cs Lines 1518 to 1519 in 1ea5336
It seems that array is created, for every tree on the ensemble, in the machinelearning/src/Microsoft.ML.LightGbm/WrappedLightGbmBooster.cs Lines 235 to 245 in 1ea5336
Back in |
System information
Issue
I used FeatureContributionCalculation with LightGbm trainer. My data pipeline contains OneHotEncoding features.
When I try to get feature contribution calculation I get the following exception
Source code / logs
My code example
I used standard data from "taxi-fare-train.csv" file from ML.Net examples.
The text was updated successfully, but these errors were encountered: