Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Draft specs doc for DatabaseLoader in ML.NET #3857

Merged
merged 8 commits into from Jul 17, 2019
Merged

Early Draft specs doc for DatabaseLoader in ML.NET #3857

merged 8 commits into from Jul 17, 2019

Conversation

CESARDELATORRE
Copy link
Contributor

@CESARDELATORRE CESARDELATORRE commented Jun 12, 2019

Early Draft specs doc for a DatabaseLoader component in ML.NET.
Feel free to provide feedback. This specs document will be evolving significantly since it is in a very early draft state.

Main objective is to load data into an IDataView from relational databases (such as SQL Server, Azure SQL Database, PostgreSQL, MySQL, Oracle, etc.) in a very easy way, one line of code in most cases by simply specifying the connection string, the database table/view/sql-sentence and what database provider/connection type it is using.

Currently in ML.NET 1.0 or 1.1 we can only do the following:

  • Load data from files by using MLContext.Data.LoadFromTextFile(), etc.
  • Load data from an IEnumerable collection with MLContext.Data.LoadFromEnumerable(), usually coming from a database, but the developer/user is responsible for how to load/query the database. It is not straightforward such as when reading from a file.

High level design goal:
The intention of the new component is to make a lot easier to implement 'model training scenarios pulling/streaming data from large database tables' while transparently solving complicated cases such as 'transient errors' in the cloud (database connections broken as a result) when using Azure SQL Database.

Copy link
Member

@markusweimer markusweimer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I did a pass and left some questions.

Copy link
Contributor

@veikkoeeva veikkoeeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some observations. Mostly that one could build so that any database with an ADO.NET provider works and that it's possible for the library code to take a dependency only to the factory methods and code, such as DbCommand, DbConnection and so on. Also pointers to concrete, functioning code that does this and seem provide the desired interface of var result = await db.QueryAsync<TResult>(SELECT * FROM Something);.

Also, maybe this way building the retries could be the problem of external libraries such as Polly or whatever one wants to use. Or provide oneself a retry function (say, a sigmoid shaped one).

- `NpgsqlConnection` (PostgreSQL)
- `OracleConnection` (Oracle)
- `MySqlConnection` (MySql)
- etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just note DbConnection and note you are compatible with any database that has a ADO.NET provider, I think. It is https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L255 much code to do plus some odd pieces to load the connector library and do like var result = await storage.ExecuteQueryAsync<TResult>(SELECT * FROM xyz);.

You can see an example of ADO.NET at https://blogs.msdn.microsoft.com/adonet/2012/07/15/using-sqldatareaders-new-async-methods-in-net-4-5-part-2-examples/ that streams photos (that might be a case for this framework).

- `MySqlConnection` (MySql)
- etc.

The specific type (`SqlConnection`, `NpgsqlConnection`, `OracleConnection`, etc.) is specified as a generic parameter so references to the needed NuGet packages will be provided by the user when trying to compile.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add another comment to reiterate do not need to take a dependency on these concrete types in the libraries. Just in case. Only connection string is needed and some other piece of knowledge which connector library to load if the user doesn't want to provide it (you probably want to have the loading mechanism to be the same nevertheless).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those types would be specified in the user's code, not in ML.NET code. Our parameters are Interfaces or base classes.
We prefer the users to need to provide a type (such as SqlConnection, NpgsqlConnection, OracleConnection) in their code so at that time they'll need to provide the dependency to their specific chosen database provider, that's the only reason we're specifying to provide a DbConnection type instead of a string with the provider such as "SQLServerProvider" or "OracleProvider", but then they'll need to know what NuGet assembly to add, etc. If not provided properly they'll get an execution exception. Instead, by providing a type (such as NpgsqlConnection) it won't compile until you add the needed NuGet package.

MLContext mlContext = new MLContext();

string conString = YOUR_CONNECTION_STRING;
using (SqlConnection connection = new SqlConnection(conString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point out that if you don't want to take a dependency on a specific ADO.NET library, you can use the factory functions and wrap it like at https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorage.cs#L26. It looks to me that .LoadFromDbRead looks like https://github.com/dotnet/orleans/blob/d9a22985c6b066837c7e1383bf5b1c672f560f4a/src/AdoNet/Shared/Storage/RelationalStorageExtensions.cs#L186, which would reduce this to new RelationalStorage and then using it to do the queries like var result = await storage.ReadAsync(Select * from InputMLModelDataset). There are test for timeout, streaming etc. too if you want to take a look. I don't say you need to do it like this, but it's close to minimum code you can do and the interface is what noted here. It can also support IAsyncEnumerable trivially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very interesting. 👍 Not sure if we want to go that path, but we'll evaluate it, for sure.
It might be interesting for our internal implementation code so we wouldn't need to take direct dependency on specific ADO.NET providers (as opposed to user's code where it is safer to take dependency to catch errors in compilation time). Thanks for these pointers to Orleans project! 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CESARDELATORRE No problem. To be clear, I don't mean you need to use that code or do something similar. Maybe more like give ideas how it could look like for what I think is a fairly minimal way to create something that produces at the end strongly typed classes while supporting all the features the various databases can have.

I wrote that part of the code so I'm somewhat familiar with the problems. For instance, some drivers (or their versions) have bugs that they are not necessarily asynchronous. Or they just don't support streaming without some extra work. Maybe the largest thing here is that I indeed wouldn't like to see this tied to any specific ORM since what would happen is that I'd pipe plain SQL through it and use for serializing data and likely end up with some inefficiencies along the way (multiple resultsets?). I don't know if people want to bring their own ORMs or just write efficient SQL that likely involves analytic/windowing functions too.

That "efficient SQL" covers also extensions like https://www.pipelinedb.com/ (PostgreSQL) or PolyBase etc.

Copy link
Contributor

@endintiers endintiers Jul 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CESARDELATORRE ,

I think that as a pre-requisite for this, we should fix IDataView LoadFromEnumerable so it works for all Enumerables. Developers can easily make Enumerables from almost anything - including any database.

Then we can create a set of helpers to create Enumerables from abstract 'Databases' (with concrete implementations). This sounds like the kind of problem that has been solved many times before? Once this is implemented then this can just feed the resulting Enumerables to LoadFromEnumerable... This is how AutoML would feed in training and test data (from databases or any kind of structured store?).

LoadFromEnumerable doesn't always work at the moment because code further on assumes that the Enumerable-backed IDataView is thread-safe - this is not always the case (ex. EF Core Context). See: https://github.com/endintiers/Unearth.Demo.ML.FromDB

I understand why limiting concurrency for trainers was a bad idea. We shouldn't. In the cases when training from huge amounts of data (in a DB for example) it is slooowww.

So I propose starting by adding a switch to LoadFromEnumerable 'singleThreaded=true' which would intercept IDataView cache update requests and force them onto a single thread. This may slow down the trainers a bit but nowhere near as much as single-threading them (where possible) and we can improve performance with eager loading.

Doing that first would remove a lot of complexity from the DatabaseLoader task (threading no longer an issue).

  • Chris

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get the initial draft in, as we are starting the work on this feature. We can iterate on it as we learn more doing the implementation.

Copy link
Contributor

@endintiers endintiers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's enough for me to get started (the objective is correct, some interesting thinking points). Bit cheeky but I'll be a heavy consumer. In any case: "code is truth" lets see what v0.1 looks like!

@veikkoeeva
Copy link
Contributor

veikkoeeva commented Jul 16, 2019

I add to the comments about some specific pain points I have had and seen others have:

  1. Have a quick and easy way to get access to database.
  2. Being able to write complex queries like row pattern recognition or graph/JSON.
  3. Being able to insert data efficiently to the database. I mean something like here or OPEN ROWSET or something alike.
  4. Ability to stream large blobs.

So something like

using(var db = new Database("connectionstring"))
{
    var res = db.QueryAsync<List<dynamic>>("SELECT * FROM Table1; SELECT * FROM Table2;");
}

Instead of using dynamic could be a class too (since there's multiple result sets, one might want to provide a Func to construct the objects (see the already linked pieces)). This way there's a quick way without too much scafolding and thinking to get working with a database and one can transport the queries across the tools.

In that sense a small core and a set of extension methods seem to cut the problem and maybe lead one to the idea of extending the core (like having a heuristic on how to bulk load data). Not everyone likes composing with functions and extensions, however.

@CESARDELATORRE CESARDELATORRE merged commit fb80f72 into dotnet:master Jul 17, 2019
Dmitry-A pushed a commit to Dmitry-A/machinelearning that referenced this pull request Jul 24, 2019
* Early Draft specs doc for DatabaseLoader in ML.NET

* Specifying implementaion as .NET Standard 2.0

* Support connect from within the RDBMS server

* Updated changes related to embeded connection

* Updates on Azure SQL DB support

* Minor updates on SQL CLR

* Minor typo fixed

* Refactored some code naming
@dotnet dotnet locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants