Skip to content

SVMLightLoader Fails above 128 dense rows #5566

@justinormont

Description

@justinormont

System information

  • OS version/distro: macOS 10.14.6
  • .NET Version (eg., dotnet --info): 5.0.101

Issue

SVMLightLoader dies if when you load >128 dense rows.

When the feature column sparsity is >0.25, internally the column is represented in sparse format, else dense. SVMLightLoader works if either the column is sparse (many missing values), or if the number of rows is < 128.

Error

Fails with one of three errors: (dataset dependent)

  • System.InvalidOperationException: Duplicate keys found in dataset

  • System.ArgumentException: Destination is too short. (Parameter 'destination')

  • System.IndexOutOfRangeException: Index was outside the bounds of the array.

Stack trace:

Unhandled exception. System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data
 ---> System.InvalidOperationException: Duplicate keys found in dataset
   at Microsoft.ML.Data.SvmLightLoader.OutputMapper.MapCore(VBuffer`1& keys, VBuffer`1& values, Output output)
   at Microsoft.ML.Data.SvmLightLoader.OutputMapper.Map(IntermediateOut intermediate, Output output)
   at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass5_0.<Microsoft.ML.Data.IRowMapper.CreateGetters>b__0()
   at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass6_0`1.<GetDstGetter>b__0(T& dst)
   at Microsoft.ML.Data.DataViewUtils.Splitter.InPipe.Impl`1.Fill()
   at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_1.<ConsolidateCore>b__2()
   --- End of inner exception stack trace ---
   at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
   at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
   at Microsoft.ML.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Data.SynchronizedCursorBase.MoveNext()
   at SVMLightLoaderTest.Program.PrintData(IDataView svmData) in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 121
   at SVMLightLoaderTest.Program.Main() in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 45

Points to:

private void MapCore(ref VBuffer<uint> keys, ref VBuffer<float> values, Output output)
{
Contracts.Check(keys.Length == values.Length, "number of keys does not match number of values.");
// Both of these inputs should be dense, but still work even if they're not.
VBufferUtils.Densify(ref keys);
VBufferUtils.Densify(ref values);
var keysValues = keys.GetValues();
var valuesValues = values.GetValues();
// The output vector could be sparse, so we use BufferBuilder here.
_bldr.Reset((int)_keyMax, false);
_indexUsed.SetAll(false);
for (int i = 0; i < keys.Length; ++i)
{
var key = keysValues[i];
if (key == 0 || key > _keyMax)
continue;
if (_indexUsed[(int)key - 1])
throw Contracts.Except("Duplicate keys found in dataset");
_bldr.AddFeature((int)key - 1, valuesValues[i]);
_indexUsed[(int)key - 1] = true;
}
_bldr.GetResult(ref output.Features);
}

Side note: It looks like Visual Studio on MacOS is not loading the symbols (or source) for ML․NET.

Source code / logs

Repro:

Bug exists in ML․NET v1.5.0 to v.1.5.4 (current). SvmLightLoader was added in v1.5.0.

Background

I was attempting to run AutoML․NET on a SVM Light dataset (download) using the CLI. But we lack SVM Light support in AutoML․NET, so I was attempting to convert the SVM Light file to a sparse TSV. The goal was to have AutoML․NET read the converted sparse TSV file, but the conversion failed.

Using MAML in v1.5.4: (fails)
dotnet ./bin/AnyCPU.Release/Microsoft.ML.Console/netcoreapp2.1/MML.dll SaveData data=Day0.svm loader=SvmLightLoader{} xf=SelectColumns{keep=Label keep=Features} saver=Text{schema=- dense=-} dout=Day0.tsv

This fails with the above errors, as the current SvmLightLoader fails.

Using TLC's MAML: (works)
maml.exe SaveData data=Day0.svm loader=SvmLightLoader{} xf=KeepColumns{col=Label col=Features} saver=Text{schema=- dense=-} dout=Day0.tsv

The old internal version of ML․NET (TLC) works properly in reading the SVM Light format and writing a TSV. The implies there was a bug introduced when we released SvmLightLoader with v1.5.0 of ML․NET.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority of the issue for triage purpose: Needs to be fixed soon.bugSomething isn't workingloadsaveBugs related loading and saving data or models

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions