[AutoML] Auto detection of extra header rows mixed into the dataset

I have dataset in text file with 20 columns, 1st column is the class name (string), other columns are features (floats)

Here are first lines of this file
```
Class	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16	A17	A18	A19
CS	61.00000	0.16855	0.00000	1.77778	3.00000	0.25375	0.07984	0.00169	0.02250	0.01535	0.07984	0.01027	0.27415	6.00000	4.00000	0.37649	3552.00000	0	26.00000
CS	316.00000	0.14823	15.00000	1.77778	10.00000	0.02352	0.00440	0.20407	0.00357	0.00914	0.03585	0.14171	0.01674	21.00000	4.00000	0.14961	4235.00000	0	17.00000
CS	176.00000	0.00000	20.00000	1.77778	3.00000	0.01850	0.19659	0.00469	0.03895	0.00000	0.19659	0.59670	0.19659	10.00000	5.00000	0.23767	3850.00000	0	24.00000
CS	133.00000	0.00000	4.00000	1.33333	3.00000	0.00049	0.01214	0.22827	0.18777	0.18778	0.12627	0.00915	0.18777	11.00000	7.00000	0.32619	1880.00000	0	16.00000
CS	140.00000	0.00000	14.00000	1.33333	1.00000	0.01787	0.02860	0.48472	0.02860	0.59853	0.02860	1.06538	0.02860	9.00000	7.00000	0.02860	1876.00000	0	142.00000
```
and the full file [data.txt](https://github.com/dotnet/machinelearning/files/3180842/data.txt)

Let's execute AutoML

> mlnet auto-train --task `multiclass-classification` --dataset "data.txt" --has-header --label-column-name `Class` --max-exploration-time 10

as a  results AutoML will generate `ModelInput.cs` file that starts like this
```csharp
 public class ModelInput
    {
        [ColumnName("Class"), LoadColumn(0)]
        public string Class { get; set; }
        [ColumnName("A1"), LoadColumn(1)]
        public string A1 { get; set; }
        [ColumnName("A2"), LoadColumn(2)]
        public string A2 { get; set; }
        [ColumnName("A3"), LoadColumn(3)]
        public string A3 { get; set; }
```

all columns are recognized as `string` instead of `float` 😢

as a result data pipeline also incorrect (`OneHotEncoding` was applied to numeric columns)
```csharp
            var dataProcessPipeline = mlContext.Transforms.Conversion.MapValueToKey("Class", "Class")
                .Append(mlContext.Transforms.Categorical.OneHotEncoding(new[]
                {
                    new InputOutputColumnPair("A3", "A3"), new InputOutputColumnPair("A4", "A4"),
                    new InputOutputColumnPair("A5", "A5"), new InputOutputColumnPair("A14", "A14"),
                    new InputOutputColumnPair("A15", "A15"), new InputOutputColumnPair("A18", "A18")
                }))
                .Append(mlContext.Transforms.Categorical.OneHotHashEncoding(new[]
                {
                    new InputOutputColumnPair("A1", "A1"), new InputOutputColumnPair("A2", "A2"),
                    new InputOutputColumnPair("A6", "A6"), new InputOutputColumnPair("A17", "A17"),
                    new InputOutputColumnPair("A19", "A19")
                }))
                .Append(mlContext.Transforms.Concatenate("Features",
                    new[] {"A3", "A4", "A5", "A14", "A15", "A18", "A1", "A2", "A6", "A17", "A19"}))
                .Append(mlContext.Transforms.NormalizeMinMax("Features", "Features"))
                .AppendCacheCheckpoint(mlContext);
```

Why in this case all columns recognized as strings?
Why in some columns `OneHotHashEncoding` was used instead of `OneHotEncoding`?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoML] Auto detection of extra header rows mixed into the dataset #5051

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AutoML] Auto detection of extra header rows mixed into the dataset #5051

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions