Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for loading data from SQL server #107

Closed
nemec opened this issue May 10, 2018 · 11 comments
Closed

Support for loading data from SQL server #107

nemec opened this issue May 10, 2018 · 11 comments
Labels
enhancement New feature or request

Comments

@nemec
Copy link

nemec commented May 10, 2018

Somewhat related to #96 but more specific, it would be nice to have a loader to stream data out of a SQL Server table or view and into this framework for training.

@justinormont justinormont added the enhancement New feature or request label May 10, 2018
@helloguo
Copy link

It seems SQL server is on the road map https://github.com/dotnet/machinelearning/blob/master/ROADMAP.md

Additional Data source support
Data from SQL Databases, such as SQL Server
Data located on the cloud

@forki
Copy link
Contributor

forki commented May 17, 2018

you want to bond the ML lib to a specific server? really?

@veikkoeeva
Copy link
Contributor

veikkoeeva commented May 17, 2018

@forki IDataView might work with arbitrary types, looking at #96 (comment). I'm not sure yet how these cursors work, but naturally having a provider for every library and source type is untenable. There might be support for ingesting data from sources like IEnumerable<T>, which would support any source quite easily. At least coming?

I'm not sure if this is the same for everyone, but for me it's a bit puzzling to see there are specific loaders for some sources, like files with CSV data, which gives the impression the idea is to bond data sources (as in providers and format of the data) quite closely to the library. Then there is discussion about these specific loaders and in-memory loaders. This suggests the idea indeed is to couple source loading and format handling together and push to a specifc pipeline. For instance, it looks to me the CSV loader input/output doesn't work in all locales and it's not easy to work around that. I also might find more performant and robust ways to load the data in any case.

@nemec
Copy link
Author

nemec commented May 17, 2018

@forki I am still having trouble understanding how the internals of this library works, but I don't believe it is controversial to suggest that there are more efficient/performant ways of training on data from a database than plugging in EF Core, transforming to IEnumerable (if the feature lands), then transforming again into an IDataView - or worse, exporting GBs/TBs of data to CSV just to load them back into this library.

If this fits better as an independent provider plugged in via NuGet, so be it. It would be nice to have at least one example of a SQL loader that OSS devs could use as a template to develop their own providers for SQLite, Mongo, etc. I would not expect this library to bundle every provider under the sun or carry a lot of potentially unused dependencies (e.g. EF Core).

@forki
Copy link
Contributor

forki commented May 18, 2018

I did not say EF ;-) - I heard some of the authors of this package worked on EF before, but please please don't bind it to EF, or sql server or any other specific tech. Or course you want to get data into the learning system without going over to csv. But the data is usually already flattened anyway, so no need for a complex OR mapping tool - especially since the target data structure is Immutable tensor

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 10, 2018

Just bumping this.

Since SQL data source is such a common ask, we will implement a version of the SQL-backed data view (and a SQL-backed IDataReader). From what I understand, we'll probably use the primitives in System.Data for this, and, therefore, it'll be available as a separate NuGet (so that ML.NET core doesn't depend on System.Data.

Then again, nothing prevents anyone from implementing IDataView with any backing storage: NoSQL databases, magnetic tape readers or anything :)

@veikkoeeva
Copy link
Contributor

Referencing #1130 as is abouth the same issue. @Zruty0, just a note that there are inconsistencies between what various ADO.NET providers support from System.Data. asynchrony etc. and you might want to consider some more advanced things like streaming (not always properly supported). The linked issue has a link to one implementation that solves most of the problems if you want to see concrete code (other libraries have the same).

@Ivanidzo4ka
Copy link
Contributor

@singlis I think you looked into loading data from Sql Server, what was result of your investigation?

@singlis
Copy link
Member

singlis commented Mar 29, 2019

Hello, sorry for the late reply.

We currently integrate with IEnumerable and there is no binding to any specific database technology.

public IDataView LoadFromEnumerable<TRow>(IEnumerable<TRow> data, SchemaDefinition schemaDefinition = null)

There is work in progress sample in the machine-learning samples. Note the sample does use EF:
dotnet/machinelearning-samples#295

I have not done any investigation yet into how this works with large datasets. This is something that we want to get an idea for what EF does -- but the end result will be fine tuning EF because at the end of the day ML.NET is working from an IEnumerable. How the data is pulled from the database and cached is a behavior for Entity Framework.

@singlis
Copy link
Member

singlis commented Mar 29, 2019

I hope this helps -- since SQL support is on the road map, I am closing this issue. If there is more to discuss, please re-open.

@singlis singlis closed this as completed Mar 29, 2019
@jwood803
Copy link
Contributor

Hopefully this isn't in bad taste, but I have a written a small post on getting data in Azure SQL and then using that to create a model. Just in case an example without EF is wanted.

Basically, what @singlis mentioned, that the LoadFromEnumerable is the magic that is needed for SQL data.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants