Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore a column on Training #2945

Closed
RussellKirkwood opened this Issue Mar 13, 2019 · 4 comments

Comments

Projects
None yet
3 participants
@RussellKirkwood
Copy link

RussellKirkwood commented Mar 13, 2019

System information

  • Win10
  • .net core 2.1

Can I tell Training to Ignore Columns in Training Data

I tried just leaving them out of Features but that does not appear to work. I thought by not including them in Features it would ignore, but prediction results are not good. If I totaly remove those fields from Data, then my predictions are really good.

I would like to ignore IncidentReportedByID, IncidentReportedMethod and ID.

public class MyData
    {
        [LoadColumn(0)]
        public float State;

        [LoadColumn(1)]
        public float City;

        [LoadColumn(2)]
        public float IncidentType;

        [LoadColumn(3)]
        public float IncidentReportedByID;

        [LoadColumn(4)]
        public float IncidentReportedMethod;
        
        [LoadColumn(5)]
        public float Label;
    }

my training looks like this

var dataProcessPipeline = mlContext.Transforms.Concatenate(DefaultColumnNames.Features, nameof(MyData.State),                                                                                   nameof(MyData.City),
                                                                                   nameof(MyData.IncidentType),                                                                                   
                                                                       .AppendCacheCheckpoint(mlContext);
           
            var trainer = mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent(labelColumnName: DefaultColumnNames.Label, featureColumnName: DefaultColumnNames.Features);
            var trainingPipeline = dataProcessPipeline.Append(trainer);

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Mar 13, 2019

So here is small experiment (I'm using master branch, so it has slightly different names, and I also need to use MapValueToKey transform to convert label into keys for multiclass)

var mlContext = new MLContext(seed: 1);
int n = 1000;
var data = new List<MyData>();
Random rnd = new Random(1);
for (int i = 0; i < n; i++)
{
    data.Add(new MyData()
    {
        City = rnd.Next(),
        IncidentReportedByID = rnd.Next(),
        IncidentReportedMethod = rnd.Next(),
        IncidentType = rnd.Next(),
        State = rnd.Next(),
        Label = rnd.Next(0, 4)
    });
}

var dataView = mlContext.Data.LoadFromEnumerable(data);
var dataProcessPipeline = mlContext.Transforms.Concatenate("Features", nameof(MyData.State), nameof(MyData.City),
                                                                       nameof(MyData.IncidentType))
                                                                       .Append(mlContext.Transforms.Conversion.MapValueToKey("Label"))
                                                           .AppendCacheCheckpoint(mlContext);

var trainer = mlContext.MulticlassClassification.Trainers.Sdca(new Trainers.SdcaMulticlassClassificationTrainer.Options() { NumberOfThreads = 1});
var trainingPipeline = dataProcessPipeline.Append(trainer);
var model = trainingPipeline.Fit(dataView);
var scored = model.Transform(dataView);
var metrics = mlContext.MulticlassClassification.Evaluate(scored);

I get following metrics:
image

Let me change data generation code to following:

data.Add(new MyData()
{
     City = rnd.Next(),
     IncidentReportedByID = rnd.Next(),
     IncidentReportedMethod = rnd.Next(),
     IncidentType = rnd.Next(),
     State = rnd.Next(),
     Label = rnd.Next(0, 4)
 });
data[i].IncidentReportedByID = 0;
data[i].IncidentReportedMethod = 0;

see
data[i].IncidentReportedByID = 0;
data[i].IncidentReportedMethod = 0;

I can put any values into this properties, float.PositiveInfinity, 0, whatever you want.
If I run same code again I'm getting same metrics.
image

Which makes me believe our code actually ignores them (I also debug trainer, and make sure we work only with 3 features in both cases, but that thing is harder to show up).

@RussellKirkwood

This comment has been minimized.

Copy link
Author

RussellKirkwood commented Mar 14, 2019

@singlis

This comment has been minimized.

Copy link
Member

singlis commented Mar 18, 2019

@RussellKirkwood I am closing this issue. If this is not resolved, please re-open.

@singlis singlis closed this Mar 18, 2019

@RussellKirkwood

This comment has been minimized.

Copy link
Author

RussellKirkwood commented Mar 18, 2019

Thanks , yes looked like it does ignore columns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.