Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception on using LightGBM trainer with FeatureContributionCalculation and OneHotEncoding #3272

Closed
vKuryshev opened this issue Apr 10, 2019 · 4 comments · Fixed by #5018
Closed
Assignees
Labels
bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Comments

@vKuryshev
Copy link

System information

  • OS version/distro: Windows 10
  • .NET Version (eg., dotnet --info): 4.7.1

Issue

  • What did you do?
    I used FeatureContributionCalculation with LightGbm trainer. My data pipeline contains OneHotEncoding features.
  • What happened?
    When I try to get feature contribution calculation I get the following exception
System.InvalidOperationException
  HResult=0x80131509
  Message=Splitter/consolidator worker encountered exception while consuming source data
  Source=Microsoft.ML.Data
  StackTrace:
   at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
   at Microsoft.ML.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Data.ColumnCursorExtensions.<GetColumnArrayDirect>d__3`1.MoveNext()
   at System.Collections.Generic.List`1..ctor(IEnumerable`1 collection)
   at System.Linq.Enumerable.ToList[TSource](IEnumerable`1 source)
   at ConsoleApp1.Program.BuildTrainEvaluateAndSaveModel(MLContext mlContext) in C:\Users\vladimir.kuryshev\source\repos\ConsoleApp1\ConsoleApp1\Program.cs:line 150
   at ConsoleApp1.Program.Main(String[] args) in C:\Users\vladimir.kuryshev\source\repos\ConsoleApp1\ConsoleApp1\Program.cs:line 45

Inner Exception 1:
ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Parameter name: slot

   at Microsoft.ML.Data.VBuffer`1.GetItemOrDefault(Int32 slot)
   at Microsoft.ML.Trainers.FastTree.InternalRegressionTree.AppendFeatureContributions(VBuffer`1& src, BufferBuilder`1 contributions)
   at Microsoft.ML.Trainers.FastTree.InternalTreeEnsemble.GetFeatureContributions(VBuffer`1& features, VBuffer`1& contribs, BufferBuilder`1& builder)
   at Microsoft.ML.Trainers.FastTree.TreeEnsembleModelParameters.<>c__DisplayClass30_0`2.<Microsoft.ML.Model.IFeatureContributionMapper.GetFeatureContributionMapper>b__0(VBuffer`1& src, VBuffer`1& dst)
   at Microsoft.ML.Data.DataViewUtils.Splitter.InPipe.Impl`1.Fill()
   at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass5_1.<ConsolidateCore>b__2()

Source code / logs

My code example

   IDataView trainingDataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(TrainDataPath, hasHeader: true, separatorChar: ',');
   var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: DefaultColumnNames.Label, inputColumnName: nameof(TaxiTrip.FareAmount))
                            .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: VendorIdEncoded, inputColumnName: nameof(TaxiTrip.VendorId)))
                            .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: RateCodeEncoded, inputColumnName: nameof(TaxiTrip.RateCode)))
                            .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: PaymentTypeEncoded, inputColumnName: nameof(TaxiTrip.PaymentType)))
                            .Append(mlContext.Transforms.Normalize(outputColumnName: nameof(TaxiTrip.PassengerCount), mode: NormalizingEstimator.NormalizerMode.MeanVariance))
                            .Append(mlContext.Transforms.Normalize(outputColumnName: nameof(TaxiTrip.TripTime), mode: NormalizingEstimator.NormalizerMode.MeanVariance))
                            .Append(mlContext.Transforms.Normalize(outputColumnName: nameof(TaxiTrip.TripDistance), mode: NormalizingEstimator.NormalizerMode.MeanVariance))
                            .Append(mlContext.Transforms.Concatenate(DefaultColumnNames.Features, VendorIdEncoded, RateCodeEncoded, PaymentTypeEncoded, nameof(TaxiTrip.PassengerCount)
                            , nameof(TaxiTrip.TripTime), nameof(TaxiTrip.TripDistance)));

   var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: DefaultColumnNames.Label, featureColumnName: DefaultColumnNames.Features);
   var trainingPipeline = dataProcessPipeline.Append(trainer);

   var trainedModel = trainingPipeline.Fit(trainingDataView);
   IDataView predictions = trainedModel.Transform(testDataView);

   var featureContributionCalculation = mlContext.Model.Explainability.FeatureContributionCalculation(trainedModel.LastTransformer.Model);
   var featureContributionData = featureContributionCalculation.Fit(predictions).Transform(predictions);
   var contributions = featureContributionData.GetColumn<float[]>(mlContext, DefaultColumnNames.FeatureContributions).ToList();

I used standard data from "taxi-fare-train.csv" file from ML.Net examples.

  • Note Issue can't be reproduced If i remove OneHotEncoding from pipeline or change trainer to FastTree for example.
@rogancarr rogancarr added the bug Something isn't working label Apr 11, 2019
@rogancarr
Copy link
Contributor

@vKuryshev Thanks for reporting this issue! Definitely not the expected results :)

@wschin This looks like a bug. Have you been able to reproduce it?

@wschin wschin added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label May 21, 2019
@harishsk harishsk added P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. and removed P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Jan 10, 2020
@antoniovs1029
Copy link
Member

antoniovs1029 commented Apr 6, 2020

I was able to reproduce this issue. Notice that there have been some changes on the API since this Issue was opened, so just for the record I'll leave here a full repro using the latest API. I will look into this.

Repro - Click to expand!
using System;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;

namespace Bugs
{
    public class TaxiTrip
    {
        [LoadColumn(0)]
        public string VendorId;

        [LoadColumn(1)]
        public float RateCode;

        [LoadColumn(2)]
        public float PassengerCount;

        [LoadColumn(3)]
        public float TripTime;

        [LoadColumn(4)]
        public float TripDistance;

        [LoadColumn(5)]
        public string PaymentType;

        [LoadColumn(6)]
        public float FareAmount;
    }

    static class DefaultColumnNames
    {
        public const string Features = "Features";
        public const string Label = "Label";
        public const string GroupId = "GroupId";
        public const string Name = "Name";
        public const string Weight = "Weight";
        public const string Score = "Score";
        public const string Probability = "Probability";
        public const string PredictedLabel = "PredictedLabel";
        public const string RecommendedItems = "Recommended";
        public const string User = "User";
        public const string Item = "Item";
        public const string Date = "Date";
        public const string FeatureContributions = "FeatureContributions";
    }

    class Program
    {
        static void Main(string[] args)
        {
            var TrainDataPath = @"C:\Users\anvelazq\Desktop\mymlnet\test\data\taxi-fare-train.csv";
            var TestDataPath = @"C:\Users\anvelazq\Desktop\mymlnet\test\data\taxi-fare-test.csv";

            var mlContext = new MLContext();
            IDataView trainingDataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(TrainDataPath, hasHeader: true, separatorChar: ',');
            IDataView testDataView = mlContext.Data.LoadFromTextFile<TaxiTrip>(TestDataPath, hasHeader: true, separatorChar: ',');

            var VendorIdEncoded = "VendorIdEncoded";
            var RateCodeEncoded = "RateCodeEncoded";
            var PaymentTypeEncoded = "PaymentTypeEncoded";

            var dataProcessPipeline = mlContext.Transforms.CopyColumns(outputColumnName: DefaultColumnNames.Label, inputColumnName: nameof(TaxiTrip.FareAmount))
                                     .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: VendorIdEncoded, inputColumnName: nameof(TaxiTrip.VendorId)))
                                     .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: RateCodeEncoded, inputColumnName: nameof(TaxiTrip.RateCode)))
                                     .Append(mlContext.Transforms.Categorical.OneHotEncoding(outputColumnName: PaymentTypeEncoded, inputColumnName: nameof(TaxiTrip.PaymentType)))
                                     .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.PassengerCount)))
                                     .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.TripTime)))
                                     .Append(mlContext.Transforms.NormalizeMeanVariance(outputColumnName: nameof(TaxiTrip.TripDistance)))
                                     .Append(mlContext.Transforms.Concatenate(DefaultColumnNames.Features, VendorIdEncoded, RateCodeEncoded, PaymentTypeEncoded, nameof(TaxiTrip.PassengerCount)
                                     , nameof(TaxiTrip.TripTime), nameof(TaxiTrip.TripDistance))); 

            var trainer = mlContext.Regression.Trainers.LightGbm(labelColumnName: DefaultColumnNames.Label, featureColumnName: DefaultColumnNames.Features);
            var trainingPipeline = dataProcessPipeline.Append(trainer);

            var trainedModel = trainingPipeline.Fit(trainingDataView);
            IDataView predictions = trainedModel.Transform(testDataView);

            var featureContributionCalculation = mlContext.Transforms.CalculateFeatureContribution(trainedModel.LastTransformer, normalize: false);
            var featureContributionData = featureContributionCalculation.Fit(predictions).Transform(predictions);
            var contributions = featureContributionData.GetColumn<float[]>(DefaultColumnNames.FeatureContributions).ToList();
        }
    }
}

@antoniovs1029
Copy link
Member

antoniovs1029 commented Apr 9, 2020

I've noticed that the exception dissapears if I set UseCategoricalSplit = false for the LightGBM trainer, or if I take only the first 50,000 rows or less from the training data.

This happens because if the training set has over 50,000 rows, then the GetCategoricalMetaData method will behave just as if I had explicitly set UseCategoricalSplit = true.

const int useCatThreshold = 50000;
// Disable cat when data is too small, reduce the overfitting.
bool useCat = LightGbmTrainerOptions.UseCategoricalSplit ?? numRow > useCatThreshold;
if (!LightGbmTrainerOptions.UseCategoricalSplit.HasValue)
ch.Info("Auto-tuning parameters: " + nameof(LightGbmTrainerOptions.UseCategoricalSplit) + " = " + useCat);
if (useCat)
{
var featureCol = trainData.Schema.Schema[DefaultColumnNames.Features];
AnnotationUtils.TryGetCategoricalFeatureIndices(trainData.Schema.Schema, featureCol.Index, out categoricalFeatures);
}

I am still not sure why this affects the sample and causes the exception. But it's clear it only happens if LightGBM uses "categorical split"

@antoniovs1029
Copy link
Member

antoniovs1029 commented Apr 10, 2020

TL;DR: This exception happens because InternalRegressionTree.AppendFeatureContributions() doesn't have support to calculate feature contributions of categorical splits, so that functionality must be added to stop this exception.

The particular exception of this sample comes because any given InternalRegressionTree of the trained model has a SplitFeatures array with a "-1" (e.g., in one particular run it's SplitFeatures = [-1, 8, 7],) which is then used in the AppendFeatureContributions() method to get a value from a VBuffer, so it throws when trying to access the negative index.

int ifeat = SplitFeatures[node];
var val = src.GetItemOrDefault(ifeat);

It seems that array is created, for every tree on the ensemble, in the LightGBM.Booster.GetModel() method. That method explicitly assigns splitFeature[node] = -1 if the decision type of the node is a categorical split:

if (GetIsCategoricalSplit(decisionType[node]))
{
int catIdx = (int)threshold[node];
var cats = GetCatThresholds(catThreshold, catBoundaries[catIdx], catBoundaries[catIdx + 1]);
categoricalSplitFeatures[node] = new int[cats.Length];
// Convert Cat thresholds to feature indices.
for (int j = 0; j < cats.Length; ++j)
categoricalSplitFeatures[node][j] = splitFeature[node] + cats[j] - 1;
splitFeature[node] = -1;
categoricalSplit[node] = true;

Back in InternalRegressionTree.AppendFeatureContributions() it is assumed that all splits are numerical, not categorical, and that is why it gets the value of ifeat from splitFeature[] when it should be getting multiple ifeats from CategoricalSplitFeatures[node] when node is a categorical split.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.
Projects
None yet
5 participants