Skip to content

Getting OutOfMemory exception while training model on large datasets in file #3869

@prathyusha12345

Description

@prathyusha12345

System information

  • OS version/distro: Windows

Issue

I am trying to create a sample dotnet/machinelearning-samples#520 to train a model on large datasets that are stored in a file. I am using BinaryClassification trainer. While training the model I am getting the OutOfMemory exception at the Fit() method as shown below.

var model = trainingPipeLine.Fit(trainTestData.TrainSet);

image

complete details of error

System.FormatException
  HResult=0x80131537
  Message=Parsing failed with an exception: Stream reading encountered exception
  Source=Microsoft.ML.Data
  StackTrace:
   at Microsoft.ML.Data.TextLoader.Cursor.<ParseParallel>d__33.MoveNext()
   at Microsoft.ML.Data.TextLoader.Cursor.MoveNextCore()
   at Microsoft.ML.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Data.LinkedRowFilterCursorBase.MoveNextCore()
   at Microsoft.ML.Data.RootCursorBase.MoveNext()
   at Microsoft.ML.Transforms.ValueToKeyMappingTransformer.Train(IHostEnvironment env, IChannel ch, ColInfo[] infos, IDataView keyData, ColumnOptionsBase[] columns, IDataView trainingData, Boolean autoConvert)
   at Microsoft.ML.Transforms.ValueToKeyMappingTransformer..ctor(IHostEnvironment env, IDataView input, ColumnOptionsBase[] columns, IDataView keyData, Boolean autoConvert)
   at Microsoft.ML.Transforms.ValueToKeyMappingEstimator.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at Microsoft.ML.Transforms.OneHotEncodingTransformer..ctor(ValueToKeyMappingEstimator term, IEstimator`1 toVector, IDataView input)
   at Microsoft.ML.Transforms.OneHotEncodingEstimator.Fit(IDataView input)
   at Microsoft.ML.Data.EstimatorChain`1.Fit(IDataView input)
   at LargeDatasetsInSqlServer.Program.Main() in C:\GitRepos\Fork\ML-samples\ML-samples-LargeDataInFile\samples\csharp\getting-started\LargeDatasetsInFile\LargeDatasetsInFile\Program.cs:line 107

Inner Exception 1:
FormatException: Stream reading encountered exception

Inner Exception 2:
OutOfMemoryException: Insufficient memory to continue the execution of the program.


The data set is copied from shared folder \ct01\data\Criteo\Spark\day_0_withHeader.tsv.

Source code / logs

Please find the entire source code from the https://github.com/prathyusha12345/machinelearning-samples/tree/LargeDatasetsInFile/samples/csharp/getting-started/LargeDatasetsInFile

Metadata

Metadata

Assignees

Labels

P0Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions