Skip to content

[AutoML] Auto detection of extra header rows mixed into the dataset #5051

@sergey-tihon

Description

@sergey-tihon

I have dataset in text file with 20 columns, 1st column is the class name (string), other columns are features (floats)

Here are first lines of this file

Class	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16	A17	A18	A19
CS	61.00000	0.16855	0.00000	1.77778	3.00000	0.25375	0.07984	0.00169	0.02250	0.01535	0.07984	0.01027	0.27415	6.00000	4.00000	0.37649	3552.00000	0	26.00000
CS	316.00000	0.14823	15.00000	1.77778	10.00000	0.02352	0.00440	0.20407	0.00357	0.00914	0.03585	0.14171	0.01674	21.00000	4.00000	0.14961	4235.00000	0	17.00000
CS	176.00000	0.00000	20.00000	1.77778	3.00000	0.01850	0.19659	0.00469	0.03895	0.00000	0.19659	0.59670	0.19659	10.00000	5.00000	0.23767	3850.00000	0	24.00000
CS	133.00000	0.00000	4.00000	1.33333	3.00000	0.00049	0.01214	0.22827	0.18777	0.18778	0.12627	0.00915	0.18777	11.00000	7.00000	0.32619	1880.00000	0	16.00000
CS	140.00000	0.00000	14.00000	1.33333	1.00000	0.01787	0.02860	0.48472	0.02860	0.59853	0.02860	1.06538	0.02860	9.00000	7.00000	0.02860	1876.00000	0	142.00000

and the full file data.txt

Let's execute AutoML

mlnet auto-train --task multiclass-classification --dataset "data.txt" --has-header --label-column-name Class --max-exploration-time 10

as a results AutoML will generate ModelInput.cs file that starts like this

 public class ModelInput
    {
        [ColumnName("Class"), LoadColumn(0)]
        public string Class { get; set; }
        [ColumnName("A1"), LoadColumn(1)]
        public string A1 { get; set; }
        [ColumnName("A2"), LoadColumn(2)]
        public string A2 { get; set; }
        [ColumnName("A3"), LoadColumn(3)]
        public string A3 { get; set; }

all columns are recognized as string instead of float 😢

as a result data pipeline also incorrect (OneHotEncoding was applied to numeric columns)

            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Class", "Class")
                .Append(mlContext.Transforms.Categorical.OneHotEncoding(new[]
                {
                    new InputOutputColumnPair("A3", "A3"), new InputOutputColumnPair("A4", "A4"),
                    new InputOutputColumnPair("A5", "A5"), new InputOutputColumnPair("A14", "A14"),
                    new InputOutputColumnPair("A15", "A15"), new InputOutputColumnPair("A18", "A18")
                }))
                .Append(mlContext.Transforms.Categorical.OneHotHashEncoding(new[]
                {
                    new InputOutputColumnPair("A1", "A1"), new InputOutputColumnPair("A2", "A2"),
                    new InputOutputColumnPair("A6", "A6"), new InputOutputColumnPair("A17", "A17"),
                    new InputOutputColumnPair("A19", "A19")
                }))
                .Append(mlContext.Transforms.Concatenate("Features",
                    new[] {"A3", "A4", "A5", "A14", "A15", "A18", "A1", "A2", "A6", "A17", "A19"}))
                .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
                .AppendCacheCheckpoint(mlContext);

Why in this case all columns recognized as strings?
Why in some columns OneHotHashEncoding was used instead of OneHotEncoding?

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions