New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CookBook sample - Could not find input column 'CategoricalOneHot' #2485

Closed
sdg002 opened this Issue Feb 9, 2019 · 5 comments

Comments

Projects
None yet
2 participants
@sdg002
Copy link

sdg002 commented Feb 9, 2019

System information

  • OS version/distro:Windows 10
  • .NET Version (eg., dotnet --info): 4.6

Issue

Source code / logs

        // Build several alternative featurization pipelines.
        var pipeline =
            // Convert each categorical feature into one-hot encoding independently.
            mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalOneHot")
            // Convert all categorical features into indices, and build a 'word bag' of these.
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalBag", Microsoft.ML.Transforms.Categorical.OneHotEncodingTransformer.OutputKind.Bag))
            // One-hot encode the workclass column, then drop all the categories that have fewer than 10 instances in the train set.
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("Workclass", "WorkclassOneHot"))
            .Append(mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount("WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10));

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.
image

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Feb 10, 2019

Hi @sdg002,
thanks for reporting.

I cannot see the schema of your data, but your first pipeline estimator, the OneHotEncoding, expect the schema to have a column called "CategoricalOneHot". Is there such a column in your dataView object?

Can you do dataView.Preview() before starting the pipeline, and inspect the names of the columns?

@sfilipi sfilipi added the need info label Feb 10, 2019

@sdg002

This comment has been minimized.

Copy link
Author

sdg002 commented Feb 10, 2019

Hi @sfilipi ,

Thanks for looking into this. Here is the code snippet from the CookBook that I have been trying to run and this leads to Could not find input column 'CategoricalOneHot'.

Please note that I had to make a couple of lines of changes to the code snippet in the CookBook to get it to compile. (See below)

Link to CookBook

https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md#how-do-i-train-my-model-on-categorical-data
I hope I am following the correct link?

Code from CookBook

    [TestMethod]
    public void CookBook_CategoricalData()
    {
        var mlContext = new Microsoft.ML.MLContext();
        //Changed CreateTextReader to CreateTextLoader
        var reader = mlContext.Data.CreateTextLoader(new[]
            {
                new TextLoader.Column("Label", DataKind.BL, 0),
                // We will load all the categorical features into one vector column of size 8.
                new TextLoader.Column("CategoricalFeatures", DataKind.TX, 1, 8),
                // Similarly, load all numerical features into one vector of size 6.
                new TextLoader.Column("NumericalFeatures", DataKind.R4, 9, 14),
                // Let's also separately load the 'Workclass' column.
                new TextLoader.Column("Workclass", DataKind.TX, 1),
            },
                        hasHeader: true
        );
        // Read the data.
        var data = reader.Read(_dataPath);


        // Inspect the first 10 records of the categorical columns to check that they are correctly read.
        var catColumns = data.GetColumn<string[]>(mlContext, "CategoricalFeatures").Take(10).ToArray();

        // Build several alternative featurization pipelines.
        var pipeline =
            // Convert each categorical feature into one-hot encoding independently.
            mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalOneHot")
            // Convert all categorical features into indices, and build a 'word bag' of these.
            //.Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalBag", CategoricalTransform.OutputKind.Bag)) //This line did not compile
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalBag", Microsoft.ML.Transforms.Categorical.OneHotEncodingTransformer.OutputKind.Bag))
            // One-hot encode the workclass column, then drop all the categories that have fewer than 10 instances in the train set.
            .Append(mlContext.Transforms.Categorical.OneHotEncoding("Workclass", "WorkclassOneHot"))
            //.Append(mlContext.Transforms.FeatureSelection.CountFeatureSelectingEstimator("WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10)); //Did not compile
            .Append(mlContext.Transforms.FeatureSelection.SelectFeaturesBasedOnCount("WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10));

        // Let's train our pipeline, and then apply it to the same data.
        var transformedData = pipeline.Fit(data).Transform(data);

        // Inspect some columns of the resulting dataset.
        var categoricalBags = transformedData.GetColumn<float[]>(mlContext, "CategoricalBag").Take(10).ToArray();
        var workclasses = transformedData.GetColumn<float[]>(mlContext, "WorkclassOneHotTrimmed").Take(10).ToArray();

    }

Sample CSV from Adult dataset

image

Changes I had to make to the code snippet

image

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Feb 11, 2019

Hi @sdg002 ,

In version 0.9 we swapped the order of the columns for the estimators, and it loos like we missed updating this snippet of the cookbook.

The order of the arguments, is OutputColumn, InputColumn: so the colum that appears second, or at the end of the estimator parameters, should exist in the dataview.

If you swap the order
mlContext.Transforms.Categorical.OneHotEncoding("CategoricalOneHot", "CategoricalFeatures")

see the test with that same pipeline here:
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Scenarios/Api/CookbookSamples/CookbookSamplesDynamicApi.cs#L371 .

Note: leaving this issue open, and we'll close it with the PR that fixes the cookbook.

@sdg002

This comment has been minimized.

Copy link
Author

sdg002 commented Feb 11, 2019

Yes. It worked after changing the order. Thanks for looking into this.

@sfilipi

This comment has been minimized.

Copy link
Member

sfilipi commented Feb 11, 2019

Resolved with #2494

@sfilipi sfilipi closed this Feb 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment