Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DatabaseLoader] Error when using attributes (i.e ColumnName) #4195

Closed
luisquintanilla opened this issue Sep 9, 2019 · 14 comments · Fixed by #4308

Comments

@luisquintanilla
Copy link

commented Sep 9, 2019

System information

  • OS version/distro: Windows 10 18362
  • .NET Version (eg., dotnet --info): .NET Core 2.2

Issue

  • What did you do?

Tried to use ColumnName attribute in class that defines IDataView schema.

  • What happened?

When operating on the IDataView, I received the following error

System.Reflection.TargetInvocationException: 'Exception has been thrown by the target of an invocation.'
 
Inner Exception
IndexOutOfRangeException: Features
  • What did you expect?

No error.

Source code / logs

Given the following data stored in a SQL Server DB

image

The data schema is defined as such

public class HouseData
{
    public float Size { get; set; }

    public float Price { get; set; }
}

The following code works:

string connectionString = @"Connection-String";

string sqlCommand = "SELECT Size,Price FROM House";

MLContext mlContext = new MLContext();

DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader<HouseData>();

DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance,connectionString,sqlCommand);

IDataView data = loader.Load(dbSource);

// Test Code
IEnumerable<HouseData> housingData = mlContext.Data.CreateEnumerable<HouseData>(data, reuseRowObject: true);

foreach (HouseData house in housingData)
{
    Console.WriteLine($"{house.Size} costs {house.Price}");
}

Console.ReadKey();

However, when attributes are added to the schema class, it produces an error.

public class HouseData
{
    [ColumnName("Features")]
    public float Size { get; set; }

    [ColumnName("Label")]
    public float Price { get; set; }
}
@CESARDELATORRE

This comment has been minimized.

Copy link
Contributor

commented Sep 13, 2019

Agree. This needs to be fixed.
Basically, the point is, when using a file, the source of truth is the index in LoadColumnAttribute, whereas when using a database the source of truth should be the C# property name that has to match the database column. Then, in order to be consistent across both scenarios, the ColumnNameAttribute allows to change the IDataView schema and change any column name for instance from "HousePrice" to "Label".

I think it should be that way so we are consistent with how the ColumnNameAttribute works when using files:

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.columnnameattribute?view=ml-dotnet
Allows a member to specify IDataView column name directly, as opposed to the default behavior of using the member name as the column name.

In the current public preview works kind of the opposite way. Such as here:

[ColumnName("fare_amount")]
public float FareAmount { get; set; }

If using the ColumnName attribute, what has to match to the database column name is [ColumnName("fare_amount")], then it you want you can change the C# property to a different name it you'd like. That approach is not crazy and sounds logical, too. In fact, you can also do that when using a file because "the source of truth is the index of the LoadColumnAttribute". But when using a database, we need to choose which one has to match the database column names, and in order to be consistent to how the ColumnNameAttribute works when using files I think it should be:

  • C# property has to match the database column name
  • The Schema column name can be changed with [ColumnName("my-changed-column-name")] or [ColumnName("label")], etc.
@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

I think it is more complicated than described above.

From what I can tell, in the file loader you have LoadColumn which is used to force the file column that is loaded and ColumnName which impacts IDataView.

In this case, the same logic is being applied between both and I believe you would be expected to set both ColumnName and LoadColumn if you wanted to achieve this in the file loader.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 23, 2019

@CESARDELATORRE, could you comment on the above?

I want to make sure we're on the same page

@CESARDELATORRE

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2019

The thing is that LoadColumn based on the index is not something devs using a database will do.
Usually you don't want devs to specify what columns to load from the table based on an 'index column' but simply on the property/attribute name of each in the class.

Then, in addition to that we want the user to be able to change the internal column's name in the schema.

@eerhardt What are your thoughts about this topic?

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 23, 2019

I think that argues for having a LoadColumn which takes a name (which then corresponds to the database column of the same name), no?

@CESARDELATORRE

This comment has been minimized.

Copy link
Contributor

commented Sep 23, 2019

That could be interesting and more flexible, for sure. But the current implementation of LoadColumnAttribute only allows an index..:

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.loadcolumnattribute?view=ml-dotnet

In any case, that could be added in the future since if the user wouldn't provide the LoadColumnAttribute then, by default, the property/field name in the class is the one to match against the database column name. Right?

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 23, 2019

Right, which ends up matching I think. Basically right now:

No Attributes Specified

Property/Field name is used for loading the data and in the data view.

ColumnName Specified

Property/Field name is used for loading the data and ColumnName attribute for the DataView name

So far, this matches the TextLoader behavior (as far as I can tell). There is additionally LoadColumn which works with the specified database index, in which case:

LoadColumn Specified

LoadColumn attribute is used for loading the data and Property/Field name is used in the DataView

LoadColumn and ColumnName specified

LoadColumn attribute is used for loading the data and ColumnName attribute is used in the DataView. Property/Field name doesn't matter

This (also as far as I can tell) also matches up with the TextLoader behavior. So extending LoadColumn to support using a string, rather than just an index, would keep the consistency and provide something that "works better" for the database loader.

@eerhardt

This comment has been minimized.

Copy link
Member

commented Sep 23, 2019

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.data.columnnameattribute?view=ml-dotnet
Allows a member to specify IDataView column name directly, as opposed to the default behavior of using the member name as the column name.

The Microsoft.ML.Data.ColumnNameAttribute is about specifying the name of the Column in the IDataView, i.e. what ML.NET sees as the Column name. It is doc'd exactly as that. I can totally understand any confusion between an IDataView "Column" and a database "Column". So we do need to make sure our behavior makes sense here.

C# property has to match the database column name

I don't think this can always be the case. For example, a SQL Table column name can have an @ symbol in it, but C# can't. You can also have spaces in a SQL column name.

select Id, hi@ as hi#, [my column] from [dbo].[Table]

image

that is totally valid. You can't always have a C# property/field name match a name coming from a database.

The thing is that LoadColumn based on the index is not something devs using a database will do.
Usually you don't want devs to specify what columns to load from the table based on an 'index column' but simply on the property/attribute name of each in the class.

  1. This is the index of the column in the DbDataReader, which comes from the order you specified the columns in the SELECT statement you gave in CommandText. This isn't always the same thing as the index of the column in the Table.
  2. Using LoadColumn is the only way you can load multiple columns into a single vector. So we are expecting this is something a dev will do.

I think that argues for having a LoadColumn which takes a name

I agree that we will need some way to map from the DbDataReader's name of the column, to the C# property, to the IDataView Column. All 3 of those things could potentially need to be named different things.

However, I don't think we should overload the LoadColumnAttribute for this. The main reason being that the TextLoader (the main consumer of LoadColumnAttribute) won't understand it. Instead, I think we should define a new attribute for this exact case - DatabaseColumnNameAttribute or similar.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 26, 2019

Thanks @eerhardt. I think that makes sense.

The remaining question on my end (before making changes and submitting a PR) is:

  • Does continuing to support LoadColumn for database loader types make sense or should it be kept as only for the text loader?
@eerhardt

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Does continuing to support LoadColumn for database loader types make sense or should it be kept as only for the text loader?

This is the only way to support reading multiple columns into a single vector, right? And if a user does wish to indicate columns by index, this would be the way they do it. So I'm thinking we still need to support it.

@CESARDELATORRE

This comment has been minimized.

Copy link
Contributor

commented Sep 27, 2019

Agree, we still need it especially because to support reading multiple columns into a single vector.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Agree, we still need it especially because to support reading multiple columns into a single vector.

Should we not be supporting doing this via the names? I would think either the index approach is sufficient (as it is for TextLoader, in which case the new attribute is just "nice to have") or it isn't and we should support using the names for all these things.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

I guess I'm also not seeing why the LoadColumn("name") couldn't also apply to the text loader scenario; considering they can also specify textual names for columns in the file.

@eerhardt

This comment has been minimized.

Copy link
Member

commented Sep 27, 2019

Should we not be supporting doing this via the names?

We could, but putting hundreds of names in the attribute seems less ideal than saying [LoadColumn(1, 300)]. So we still need to support index based here.

I would think either the index approach is sufficient ... or it isn't

I agree. My thinking is that the index approach is sufficient and mapping by names boarders on "nice to have". But it sounds like @CESARDELATORRE thinks it is more than just "nice to have".

I guess I'm also not seeing why the LoadColumn("name") couldn't also apply to the text loader scenario; considering they can also specify textual names for columns in the file.

I've asked for this in the past, and it was never implemented. I found some context here - #1515 (comment). You can probably find more by looking through the issues and PRs - #561 and #1878.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.