System information
- OS version/distro: macOS 10.14.6
- .NET Version (eg., dotnet --info): 5.0.101
Issue
SVMLightLoader dies if when you load >128 dense rows.
When the feature column sparsity is >0.25, internally the column is represented in sparse format, else dense. SVMLightLoader works if either the column is sparse (many missing values), or if the number of rows is < 128.
Error
Fails with one of three errors: (dataset dependent)
-
System.InvalidOperationException: Duplicate keys found in dataset
-
System.ArgumentException: Destination is too short. (Parameter 'destination')
-
System.IndexOutOfRangeException: Index was outside the bounds of the array.
Stack trace:
Unhandled exception. System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data
---> System.InvalidOperationException: Duplicate keys found in dataset
at Microsoft.ML.Data.SvmLightLoader.OutputMapper.MapCore(VBuffer`1& keys, VBuffer`1& values, Output output)
at Microsoft.ML.Data.SvmLightLoader.OutputMapper.Map(IntermediateOut intermediate, Output output)
at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass5_0.<Microsoft.ML.Data.IRowMapper.CreateGetters>b__0()
at Microsoft.ML.Transforms.CustomMappingTransformer`2.Mapper.<>c__DisplayClass6_0`1.<GetDstGetter>b__0(T& dst)
at Microsoft.ML.Data.DataViewUtils.Splitter.InPipe.Impl`1.Fill()
at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass7_1.<ConsolidateCore>b__2()
--- End of inner exception stack trace ---
at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Data.SynchronizedCursorBase.MoveNext()
at SVMLightLoaderTest.Program.PrintData(IDataView svmData) in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 121
at SVMLightLoaderTest.Program.Main() in /Users/justinormont/Projects/SVMLightLoaderTest/SVMLightLoaderTest/Program.cs:line 45
Points to:
|
private void MapCore(ref VBuffer<uint> keys, ref VBuffer<float> values, Output output) |
|
{ |
|
Contracts.Check(keys.Length == values.Length, "number of keys does not match number of values."); |
|
|
|
// Both of these inputs should be dense, but still work even if they're not. |
|
VBufferUtils.Densify(ref keys); |
|
VBufferUtils.Densify(ref values); |
|
var keysValues = keys.GetValues(); |
|
var valuesValues = values.GetValues(); |
|
|
|
// The output vector could be sparse, so we use BufferBuilder here. |
|
_bldr.Reset((int)_keyMax, false); |
|
_indexUsed.SetAll(false); |
|
for (int i = 0; i < keys.Length; ++i) |
|
{ |
|
var key = keysValues[i]; |
|
if (key == 0 || key > _keyMax) |
|
continue; |
|
if (_indexUsed[(int)key - 1]) |
|
throw Contracts.Except("Duplicate keys found in dataset"); |
|
_bldr.AddFeature((int)key - 1, valuesValues[i]); |
|
_indexUsed[(int)key - 1] = true; |
|
} |
|
_bldr.GetResult(ref output.Features); |
|
} |
Side note: It looks like Visual Studio on MacOS is not loading the symbols (or source) for ML․NET.
Source code / logs
Repro:
Bug exists in ML․NET v1.5.0 to v.1.5.4 (current). SvmLightLoader was added in v1.5.0.
Background
I was attempting to run AutoML․NET on a SVM Light dataset (download) using the CLI. But we lack SVM Light support in AutoML․NET, so I was attempting to convert the SVM Light file to a sparse TSV. The goal was to have AutoML․NET read the converted sparse TSV file, but the conversion failed.
Using MAML in v1.5.4: (fails)
dotnet ./bin/AnyCPU.Release/Microsoft.ML.Console/netcoreapp2.1/MML.dll SaveData data=Day0.svm loader=SvmLightLoader{} xf=SelectColumns{keep=Label keep=Features} saver=Text{schema=- dense=-} dout=Day0.tsv
This fails with the above errors, as the current SvmLightLoader fails.
Using TLC's MAML: (works)
maml.exe SaveData data=Day0.svm loader=SvmLightLoader{} xf=KeepColumns{col=Label col=Features} saver=Text{schema=- dense=-} dout=Day0.tsv
The old internal version of ML․NET (TLC) works properly in reading the SVM Light format and writing a TSV. The implies there was a bug introduced when we released SvmLightLoader with v1.5.0 of ML․NET.
System information
Issue
SVMLightLoader dies if when you load >128 dense rows.
When the feature column sparsity is >0.25, internally the column is represented in sparse format, else dense. SVMLightLoader works if either the column is sparse (many missing values), or if the number of rows is < 128.
Error
Fails with one of three errors: (dataset dependent)
Stack trace:
Points to:
machinelearning/src/Microsoft.ML.Transforms/SvmLight/SvmLightLoader.cs
Lines 364 to 388 in 5dbfd8a
Side note: It looks like Visual Studio on MacOS is not loading the symbols (or source) for ML․NET.
Source code / logs
Repro:
Bug exists in ML․NET v1.5.0 to v.1.5.4 (current). SvmLightLoader was added in v1.5.0.
Background
I was attempting to run AutoML․NET on a SVM Light dataset (download) using the CLI. But we lack SVM Light support in AutoML․NET, so I was attempting to convert the SVM Light file to a sparse TSV. The goal was to have AutoML․NET read the converted sparse TSV file, but the conversion failed.
Using MAML in v1.5.4: (fails)
dotnet ./bin/AnyCPU.Release/Microsoft.ML.Console/netcoreapp2.1/MML.dll SaveData data=Day0.svm loader=SvmLightLoader{} xf=SelectColumns{keep=Label keep=Features} saver=Text{schema=- dense=-} dout=Day0.tsvThis fails with the above errors, as the current SvmLightLoader fails.
Using TLC's MAML: (works)
maml.exe SaveData data=Day0.svm loader=SvmLightLoader{} xf=KeepColumns{col=Label col=Features} saver=Text{schema=- dense=-} dout=Day0.tsvThe old internal version of ML․NET (TLC) works properly in reading the SVM Light format and writing a TSV. The implies there was a bug introduced when we released SvmLightLoader with v1.5.0 of ML․NET.