Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Switching to FluentMigrator and Linq2Db for a backup sqlite database operations #4389

Open
1 task done
SomebodyOdd opened this issue Dec 11, 2020 · 7 comments

Comments

@SomebodyOdd
Copy link

SomebodyOdd commented Dec 11, 2020

  • I have searched open and closed issues for duplicates.

Hello there!

I'd like to ask everyone involved in a project on opinion.
IMO, code that deals with local db operations is less than ideal in respect to maintainability and overall readability. Those huge string literals with SQLite commands and low level DB class usage is something that could improved on. How about switching to something like FluentMigrator and Linq2Db (specific tools to use are up for debate, of course, used these as an example).

I'd like to at least try to implement this, but I want to listen for a feedback first.
Will it sit right with core dev team? Is this something that could make dealing with duplicati code easier for everyone and not just myself?

This switch could make optimizing backup operations easier and make possible integration with server DBMS as well (although, this is not final goal of this proposal). Not to mention that this won't hurt possible future migration to .net 5, if this ever arise.

So, what do you think?

@SomebodyOdd
Copy link
Author

Looks like noone is interested in that. Closing an issue, please reopen if I'm wrong and there is some feedback on that.

@ts678
Copy link
Collaborator

ts678 commented Dec 15, 2020

Looks like noone is interested

Maybe nobody is familiar with the tools or the philosophy of use. I would be interested in your input on your question:

Is this something that could make dealing with duplicati code easier for everyone

because Duplicati very much needs DB volunteers. I'm not one, but would like to minimize the barriers to new entrants.
This could include an occasional contributor who shows up with something, but core team is also not that good at SQL.

These tools (to the extent I could follow them) seemed like they reduced "low level" code but needed high level training.
There are a lot of training resources for generic SQL. I'm not sure about the resources or learning curve for higher tools.

From a non-DB-person view, I find that current code is hard to read and follow. How safely can any migration be done?
There would not be a team of experts available to explain every feature of current code, yet migrated code should work.

From a maintainability point of view, would migrated code be easier to read and follow by those who aren't DB-expert?
Possibly you could post a sample of migrated code, so developers could see whether it seems an improvement to them.

This switch could make optimizing backup operations easier

Duplicati does has some SQL code that should go faster. Maybe DB design is off, maybe queries are off. Any help there?
The help short-term would probably be in writing equivalent-but-faster SQL. Big schema redesigns might be too much.

There might also be some actual bugs in logic (either C# or the SQL it runs). Tracking them down in field is not possible.
Tracking them down in reproducible lab conditions can involve lots of digging through logs of SQL use. Any help there?

There are currently some cases where crashes (or kills) leave DB damaged. Rollback design is not understood. Any help?

FluentMigrator

Switching to what I guess is schema migration, there is currently a version-based DatabaseUpgrader that upgrades the
databases as needed (which means a system with multiple backup jobs defined may have databases at different levels).

Change increments are under source control. Downgrades are tough, e.g. if older Duplicati is installed. Any help there?
This reversibility does seem somewhat in line with what tools can do, but I'm not sure what new code they will require.
Learning curve question still applies to both forward and backward moves. Currently there is no backward code, AFAIK.

My current view of dealing with Duplicati DB code seems to be most people are hesitant to change for fear of breaking.
If you're someone who can go in with some confidence (despite fewer tools and tough code read), you're very welcome.
I can explain some of the general idea (also loosely diagrammed here), but not the details of why the built SQL is as it is

@SomebodyOdd
Copy link
Author

SomebodyOdd commented Dec 15, 2020

Okay, let me start with a big thank you for your pretty detailed answer, @ts678, it covers a lot of what I was curious of about core dev team.
And also sorry for any possible grammar mistakes, English is not my native language.

I'll set a scene first.

  1. I'm not suggesting a schema redesign. With current tools and code, that might take a big effort and code changes, so I'm assuming that current schema is good enough and won't be changed.
    My point is not redesigning schema itself, but the way that this schema is managed and accessed.

  2. I'm not suggesting to do a complete architectural changes in duplicati. That is - database before said improvements will look really close to what was before and mean the same thing. Same concepts of blocks, volumes will remain untouched by this.

  3. Goal is to be partially backwards compatible. As I see it - new duplicati versions will be able to upgrade local DB to work with new migration system without a data loss, but this is a one way street - no going back to old one, and no compatibility with older versions of duplicati (but I'm not sure if using multiple versions of duplicati to manage exact same local database is something that supported right now anyway, correct me if I'm wrong).
    Of course, that won't affect scenario when an older version of duplicati recreates DB in its format from remote files - format of remote storage is not intended to change.

Lets iterate over what these proposed tools are (and it's still debatable, I'm not insisting on these, but kinda defaulting to it).

FluentMigrator

Almost every app (and, as far as i can tell, duplicati is no exception) needs some way to upgrade and keep track of current versions of databases. FluentMigrator is a tool that helps with that.
Yes, you are right, that is indeed implementation of schema migration you mentioned, that lets you to abstract from a specific database engine (like SQLite) and describe your schema with classes and methods, and not SQL. It includes a migration runner as well, meaning that it can upgrade older versions of DB to a newer ones without requiring a developer to write a complete raw SQL script to do it. You just specify what you want to do in migration class, and FluentMigrator will figure out what SQL commands should be executed to update DB to a last version.

For example, migration that describes current schema might look something like this:
https://pastebin.com/0HMu3ZF3 (got a little bit lazy on table descriptions and Down method, but I think that should be enough for an example).

And a following modification of DB would look like this:
https://pastebin.com/QZaRi2uQ

Pros:

  • Partial abstraction off SQLite and SQL.
  • Version management included, supports both up- and down-grading, if needed (of course, if developer specified how to downgrade migration).
  • You don't need to have separate paths for updating existing database and creating new one from scratch like (I suppose) it is right now. When you create new DB - you just create an empty one and then execute all migrations, one by one.
  • (opinion) Easier to follow and maintain.
  • (opinion) Uses a well-known library with well documented API, thus lowering time needed to start contributing to duplicati in regards to DB code.

Cons:

  • Although it lets you to abstract off SQL, you still need to have basic understanding of what you are doing. Indexes won't create themselves =)
  • Working and maintaining this will require at least some knowledge of what FluentMigrator and migrations are.
  • Performance penalty when upgrading or creating DBs, compared to executing raw prepared scripts.
  • This approach has some limitations when it come to using some specific features, like views or non-standard data types. This can be solved using raw SQL statements in migrations, which does require some SQL knowledge and might be a problem if there will be a need to switch SQLite to something else (but, arguably, still less problems than it would be right now).

Linq2db

Managing schema is one problem. Second one is accessing said schema.
Right now - duplicati uses raw SQL queries and low-level SqlConnection classes to access and modify data. This is simple solution from an architecture and code standpoint, but this results in long string literals with SQL requests all over the code, like here.

Linq2db is an ORM that lets to rewrite SQL queries into a strongly typed LINQ queries.
For example, querying last 50 entries of log data might look something like this:

using(var dbContext = new BackupDb()) {
  var logEntries = dbContext.LogData.OrderByDescending(l => l.Timestamp).ThenByDescending(l => l.ID).Take(50).ToList();
}

Linq expression is translated into SQL script, and gets sent to db engine (in case of duplicati, it's SQLite) for processing and result is then mapped on DTO object for ease of use and compile-time type checking.

Pros:

  • More abstraction off SQL - you just need to write a LINQ query, that gets translated into SQL by a library.
  • Strongly typed way to access DB data.
  • (opinion) Easier transaction management. SQLite supports transactions, and Linq2db also does.
  • (opinion) Easier optimizations. You don't have to query whole DTO, you can always query only specific columns and still have strongly typed anonymous object to work with. SQLite won't even read omitted columns from disk, if that is possible for that query.
  • (opinion) Again, it is well-known library with well documented API, that could help newcomers with getting started. Not to mention, that there is already many people that use Linq2db and they might fell at home.

Cons:

  • Again, even if you don't need to code in SQL directly, it's still easier if you have at least basic understanding of SQL queries.
  • You'll have to be comfortable with LINQ, especially if JOINs are involved.
  • Some performance penalty on DTO mapping. Although it's not that bad, compared to other ORMs, like EntityFramework.
  • If you never used an ORM before, than it will be a new thing to learn and get used with, which might be bad for someone who is already working on duplicati.
  • Using engine-specific features does require raw SQL queries (which is supported), just like with FluentMigrator. "Gotchas" here are same as with FluentMigrator - could result in some troubles when changing SQLite to something else.

TL;DR: Integrating those tools won't be easy and will require significant changes in DB code, but will ease future modifications.
This will require some knowledge of this tools though, which might be a problem for core team.

Now, about your speific concerns.

These tools (to the extent I could follow them) seemed like they reduced "low level" code but needed high level training.

You are right that it requires some additional skills, but it's not as high as you think. Both of these tools have great (IMO) documentation and plenty of info out there to read and use. Yes, community is not as big as SQL, but these tools are pretty widespread in .NET community, and concepts of ORM and migrations are similar and wildly used in many other languages and ecosystems as well. I would consider using this tools as valuable experience.

I'm not sure about the resources or learning curve for higher tools.

Linq2db builds on LINQ, which is native to .NET and has plenty of resources and guides on how to use it. Learning is of course necessary, but barrier for entry is IMO not that high for an experienced .NET developer. For someone who starts in .NET that might be a bit much, but ORMs and migrations are there in other languages too, so they can reuse some of existing knowledge, and if there is nothing to reuse - it's pretty valuable thing to learn and have at least basic understanding of.

From a maintainability point of view, would migrated code be easier to read and follow by those who aren't DB-expert?

I would say yes.

Duplicati does has some SQL code that should go faster. Maybe DB design is off, maybe queries are off. Any help there?

I would say that yes, having an easier way to make DB queries and modify DB schema could help in optimization. It will be easier to optimize something if you'll have easier time in changing logic of this thing.
If what you are saying is "How about just rewriting existing SQL queries?" than yeah, it would be easier in short-term, but probably won't do any good in the long run. And adding new features might complicate things even more, bringing us back to where we started, just with more code to consider.

There might also be some actual bugs in logic (either C# or the SQL it runs). Tracking them down in field is not possible. Tracking them down in reproducible lab conditions can involve lots of digging through logs of SQL use. Any help there?

There are some tools to log executed SQL queries of course.

There are currently some cases where crashes (or kills) leave DB damaged. Rollback design is not understood. Any help?

Yes, transactions are supported by said tools, and are pretty easy to use. It's semantics doesn't really differ from what you can find SQL (exactly because it uses SQL transactions under the hood), so learning these won't be a big challenge.

I'm willing to be that DB volunteer for this transition (and although I have some understanding of DB and tools above, I'm not an expert by any stretch) and I will be happy to help duplicati more in the future, but at the same time I understand that such a big change will require lots of testing and might not sit well with everyone. After all - if core team won't be able or doesn't want to deal with ORM and FluentMigrator then it might be better to leave it as is.

@SomebodyOdd SomebodyOdd reopened this Dec 15, 2020
@ts678
Copy link
Collaborator

ts678 commented Dec 17, 2020

Thanks for the extensive note. I'd probably have to get at least several key people interested for this to happen. I'll go ask...
Right now the main goal seems to be getting out of Beta, and this transition, while maybe good long-term, might not help.

Possibly after next Beta goes out (it's about time right now) would be a chance to tinker and maybe break things in Canary.
The other big change in the works is .NET 5. From a GitHub branch view, Duplicati's not well set up for parallel big changes.
Releases come from the master branch, and if it gets broken or loses functionality for awhile during development, it hurts.

There are some tools to log executed SQL queries of course.

Duplicati does this in a profiling log now. As you might imagine, the SQL queries that are seen are long and tough to read.
SQL from an ORM sounds like it could be too (or maybe it can do better). Can one trace slow SQL back to its source code?
In a small bit of web searching, I saw one discussion where participants had opposite views on the readability of ORM SQL.

I'm not sure if this is a fair comparison, but this reminds me of some round-trip challenges that code generators may pose.
What looks very nice at a high level becomes a challenge to debug at a low level when things goes wrong. Murphy's Law....
Object–relational impedance mismatch from your ORM article link talks about this some, but this is far out of my expertise.

Thanks again.

@warwickmm
Copy link
Member

Thanks @SomebodyOdd and @ts678 for the detailed thoughts. I haven't had time to read the above conversation carefully, but let me leave some brief comments here:

  1. I agree that the queries could definitely use some work, both in terms of correctness, readability, and performance.
  2. Using a LINQ-like syntax for abstraction would have some benefits. However, I do often find actual SQL queries easier to understand. I sometimes find myself having to mentally translate LINQ queries back to a "standard" representation. In other words, LINQ is sometimes too high-level.
  3. In terms of readability, I think simply reformatting the queries into a more consistent and readable format (as opposed to the query being defined on a single line) would be a huge benefit.
  4. If one does embark on such a task, I would strongly recommend that we:
    • Convert one query at a time, with a separate pull request for each into a long-lived branch. The long-lived branch can present maintenance issues, but since we don't modify the query code that often, perhaps it won't be too bad.
    • Have sufficient tests for correctness and performance. We are slowly getting better here.
    • Have enough developer time and skill for adequate review. This is where we are really struggling. Unfortunately, most of our active developers have only recently joined the project, and have limited knowledge about the origins and reasons behind much of the code. I can imagine many queries where the translation is simple and clear. Others can be much more complicated.

@kenkendk
Copy link
Member

Thanks @SomebodyOdd for bringing it up! And thanks @ts678 & @warwickmm for the follow up.

When I originally wrote the queries, the only substitute I could find was the NHibernate and EntityFramework-like solution, which I deemed too convoluted. As it often happens, the low-level stuff got pretty convoluted itself.

I have worked a bit with Dapper which seems to be a middle-ground, in that you have strings, but can avoid embedded column/table strings. It takes a lot of the guessing out, but does not add any compatibility magic.

I recently wrote something similar to Linq2db, which I think is a nice way work with databases. It is important that we can "go back to text", if we need to run one of the queries that are not supported.

I am in favor of converting some of the more complex queries to Linq2db, and see if it works as expected. If it works, well we can consider eventually converting all the queries, and measure if there is any performance impacts.

@ts678
Copy link
Collaborator

ts678 commented Jan 30, 2024

As part of the Issues clean up effort to make bugs fewer and more visible, I'm giving this an enhancement label to move it aside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants