Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate SQL Performance #2050

Closed
1 task done
jarfil opened this issue Oct 24, 2016 · 16 comments
Closed
1 task done

Recreate SQL Performance #2050

jarfil opened this issue Oct 24, 2016 · 16 comments

Comments

@jarfil
Copy link

jarfil commented Oct 24, 2016

I have:

  • searched open and closed issues for duplicates

Related issues: #1715 #1582 #1391 #1881 #1928


Version info

Duplicati Version: 2.0.1.28_canary_2016-10-19
Operating System: Windows 10 Pro x64
Backend: ACD

Bug description

Database recovery runs some crazy slow SQL queries that take over 10 minutes each.

Steps to reproduce

  • Create backup of 500k files and 100GB, with 50MB volume size
  • Delete database
  • Attempt to rebuild database

Actual result:

  • Running 24+ hours and still not finished the database rebuild.
  • Keeps downloading actual data block files instead of just the indices.
  • 10 minute SQL queries for every 15 second data block download.

Expected result:

  • Download just the indices
  • Finish ASAP
  • Worst case scenario, have an additional option for recreating the indices

debug log

INSERT INTO "BlocksetEntry" ("BlocksetID", "Index", "BlockID") SELECT DISTINCT "BlocksetID", "Index", "BlockID" FROM (SELECT "E"."BlocksetID" AS "BlocksetID", "D"."FullIndex" AS "Index", "F"."ID" AS "BlockID" FROM ( 
 SELECT DISTINCT
 "E"."BlocksetID",
 "F"."Index" + ("E"."BlocklistIndex" * 3200) AS "FullIndex",
 "F"."BlockHash",
 MIN(102400, "E"."Length" - (("F"."Index" + ("E"."BlocklistIndex" * 3200)) * 102400)) AS "BlockSize",
 "E"."Hash",
 "E"."BlocklistSize",
 "E"."BlocklistHash"
 FROM
 (
 SELECT * FROM
 (
 SELECT 
 "A"."BlocksetID",
 "A"."Index" AS "BlocklistIndex",
 MIN(3200 * 32, ((("B"."Length" + 102400 - 1) / 102400) - ("A"."Index" * (3200))) * 32) AS "BlocklistSize",
 "A"."Hash" AS "BlocklistHash",
 "B"."Length"
 FROM 
 "BlocklistHash" A,
 "Blockset" B
 WHERE 
 "B"."ID" = "A"."BlocksetID"
 ) C,
 "Block" D
 WHERE
 "C"."BlocklistHash" = "D"."Hash"
 AND
 "C"."BlocklistSize" = "D"."Size"
 ) E,
 "TempBlocklist-9D597DA14163454FA26B16C55ED54E02" F
 WHERE
 "F"."BlocklistHash" = "E"."Hash"
 ORDER BY 
 "E"."BlocksetID",
 "FullIndex"
 ) D, "BlocklistHash" E, "Block" F, "Block" G WHERE "D"."BlocklistHash" = "E"."Hash" AND "D"."BlocklistSize" = "G"."Size" AND "D"."BlocklistHash" = "G"."Hash" AND "D"."Blockhash" = "F"."Hash" AND "D"."BlockSize" = "F"."Size" UNION SELECT "Blockset"."ID" AS "BlocksetID", 0 AS "Index", "Block"."ID" AS "BlockID" FROM "Blockset", "Block", "TempSmalllist-F1F3377EE0C9414592F4CE7C69865B81" S WHERE "Blockset"."Fullhash" = "S"."FileHash" AND "S"."BlockHash" = "Block"."Hash" AND "S"."BlockSize" = "Block"."Size" AND "Blockset"."Length" = "S"."BlockSize" AND "Blockset"."Length" <= 102400 ) A WHERE ("A"."BlocksetID" || ':' || "A"."Index") NOT IN (SELECT ("BlocksetID" || ':' || "Index") FROM "BlocksetEntry" )
took 00:10:42.373

Suspects

  • Optimize the SQL queries
    • several-hundred-MB sized temporary tables are being created during query execution
    • Duplicati.Library.Main.Database.LocalRecreateDatabase:202 FindMissingBlocklistHashes
  • Could the missing blocklist search be made optional?
    • Duplicati.Library.Main.Operation.RecreateDatabaseHandler:352 restoredb.FindMissingBlocklistHashes
  • SQLite cache options
  • Replace SQLite with external SQL database like MySQL, PostgreSQL, etc. as an option.

@agrajaghh
Copy link
Contributor

I guess there are two issues here:

  • slow SQL Operations which should be optimized
  • you ran into a bug which made it necessary to download the dblock files

Did the backup job had any errors before you deleted the database? Do you still have the log?

@saviodsouza
Copy link

slow SQL Operations which should be optimized

SQLite is very slow specially in cases of large backups. when sqlite grows +600MB and more. Rebuilding database goes on and on.

SQLite cache options
Replace SQLite with external SQL database like MySQL, PostgreSQL, etc. as an option.

Replacing sqlite with alternate centralized database should be taken in consideration specially for enterprise backup deployment.
Cases when Administrators need to restore data backup-ed up from servers and endpoints clients, rebuilding database in critical time today takes hours or days depending on various factors.

With this while configuring backups you need to choose between sqlite, mysql-db, etc
An admin console can be built to control and monitor logs, etc.

@agrajaghh
Copy link
Contributor

agrajaghh commented Oct 25, 2016

Did the backup job had any errors before you deleted the database? Do you still have the log?

this might be helpful to identify why the dblock files were downloaded and not just the dlist and dindex files...

@mnaiman
Copy link
Contributor

mnaiman commented Oct 26, 2016

I think, that if queries will be hard to optimize, it will be better to process algorithms in c# object by linq.
It will be thousands % faster, because sqlite is with bigger DBs slow like hell.
But is good storage for result.

Centralized DBs are faster, but than it will not be compact portable solution.

@kenkendk
Copy link
Member

@jarfil :
The call to FindMissingBlocklistHashes appears to append to the Block and BlocksetEntry tables, and does not appear to create any temporary tables. Perhaps what you see is the transaction files that SQLite creates when performing the operations. The transaction files are created to keep the database consistent if the system crashes during the operation, and are then merged into the main database.
We can reduce these transactions by collecting more than a single block before updating the database, and I will try that and see what happens.

@saviodsouza : Do you have any references that shows SQLite is slow when the DB is larger than 600MB ?
From what I discussed with @FootStark SQLite is generally as fast as the enterprise DB's, but lacks many features, such as auto-optimization etc.
Could the problem be that you are using a traditional "spinning-disk" and not an SSD?
I could imagine some performance issues if that is the case.

Everything inside Duplicati is using a standard IDbConnection that could point to any database, but we do not currently have anything that allows you to set a custom connectionstring, but that should not be too hard to change.
I think allowing this would make sense to solve the enterprise scenario you mention, but I would not expect it to become much faster.

@mnaiman : It will not be possible to do this with LinQ, as the databases are too large to fit in memory (which is why there are so many hoops to query directly on the data).

kenkendk added a commit that referenced this issue Oct 27, 2016
…we need all the volumes anyway.

This is an experiment related to #2050.
@saviodsouza
Copy link

hi

YES you are right I have traditional 'spinning-disk' 320 GB.
This is the reason every time I attempt to do a file restore test the arrow in restore files pages turns in a pie and goes on and on.

Correct me if I am wrong.
Logs are also saved in the same sqlite file. If we store logs in a separate db the original db with file details, hashes, etc will be smaller in size. All I want to say about sqlite dbs is can we reduce the file size by storing logs, temp db etc in a separate db.

@gerco
Copy link
Contributor

gerco commented Nov 3, 2016

I also experienced slowness rebuilding the sqlite database. In my case the database is about 1.4GB and my backup holds ~800k files totaling ~200GB. I did a rebuild which didn't finish overnight. I'm running on a MacBook Pro with SSD and 16GB ram.

I haven't done any research as to why it's slow but I'll look into it and see if I can come up with something more concrete.

@dsl
Copy link

dsl commented Nov 3, 2016

I've been running a rebuild since October 26th, which is showing no signs of finishing any time soon. :)
That's over 7 days.. :)

@saviodsouza
Copy link

Working with command line

Downloading file (49.98 MB)....
Downloading file (49.97 MB)....
etc

its endless and each file is extreemly slow.
would it be nice to show as

Downloading file 1 of 1044 (49.98 MB)...
Downloading file 2 of 1044 (49.97 MB)...

it would be showing some count.

What are the possible reasons to re-download all the blocks to build the database?

Secondly should support resuming of creating database.
I get something like 'in progress .... mark incomplete'
I have to delete the old database and start all over again.

Regards,
Savio

@sibbi77
Copy link

sibbi77 commented Dec 21, 2016

I have a similar problem, but it is the SQL verify step, which takes ages:

sqlite> select * from LogData;
1|1|1482171632|Message|Rebuild database started, downloading 35 filelists|
2|1|1482181833|Message|Filelists restored, downloading 2150 index files|
3|1|1482183848|Message|Recreate completed, verifying the database consistency|

from log:
Rebuild started 12/19/2016 @ 6:20pm (UTC)
Recreate completed 12/19/2016 @ 9:44pm (UTC)

The profiling log from the web interface shows:

[...]
21. Dez. 2016 10:33: ExecuteScalar: SELECT COUNT(*) FROM (SELECT DISTINCT "Path" [...]
21. Dez. 2016 06:42: Starting - ExecuteScalar: SELECT COUNT(*) FROM (SELECT DISTINCT "Path" FROM ( SELECT "L"."Path", "L"."Lastmodified", "L"."Filelength", "L"."Filehash", "L"."Metahash", "L"."Metalength", "L"."BlocklistHash", "L"."FirstBlockHash", "L"."FirstBlockSize", "L"."FirstMetaBlockHash", "L"."FirstMetaBlockSize", "M"."Hash" AS "MetaBlocklistHash" FROM ( SELECT "J"."Path", "J"."Lastmodified", "J"."Filelength", "J"."Filehash", "J"."Metahash", "J"."Metalength", "K"."Hash" AS "BlocklistHash", "J"."FirstBlockHash", "J"."FirstBlockSize", "J"."FirstMetaBlockHash", "J"."FirstMetaBlockSize", "J"."MetablocksetID" FROM ( SELECT "A"."Path" AS "Path", "D"."Lastmodified" AS "Lastmodified", "B"."Length" AS "Filelength", "B"."FullHash" AS "Filehash", "E"."FullHash" AS "Metahash", "E"."Length" AS "Metalength", "A"."BlocksetID" AS "BlocksetID", "F"."Hash" AS "FirstBlockHash", "F"."Size" AS "FirstBlockSize", "H"."Hash" AS "FirstMetaBlockHash", "H"."Size" AS "FirstMetaBlockSize", "C"."BlocksetID" AS "MetablocksetID" FROM "File" A, "Blockset" B, "Metadataset" C, "FilesetEntry" D, "Blockset" E, "Block" F, "BlocksetEntry" G, "Block" H, "BlocksetEntry" I WHERE "A"."ID" = "D"."FileID" AND "D"."FilesetID" = ? AND "A"."BlocksetID" = "B"."ID" AND "A"."MetadataID" = "C"."ID" AND "E"."ID" = "C"."BlocksetID" AND "B"."ID" = "G"."BlocksetID" AND "G"."BlockID" = "F"."ID" AND "G"."Index" = 0 AND "I"."BlocksetID" = "E"."ID" AND "I"."BlockID" = "H"."ID" AND "I"."Index" = 0 ) J LEFT OUTER JOIN "BlocklistHash" K ON "K"."BlocksetID" = "J"."BlocksetID" ORDER BY "J"."Path", "K"."Index" ) L LEFT OUTER JOIN "BlocklistHash" M ON "M"."BlocksetID" = "L"."MetablocksetID" ) UNION SELECT DISTINCT "Path" FROM ( SELECT "G"."BlocksetID", "G"."ID", "G"."Path", "G"."Length", "G"."FullHash", "G"."Lastmodified", "G"."FirstMetaBlockHash", "H"."Hash" AS "MetablocklistHash" FROM ( SELECT "B"."BlocksetID", "B"."ID", "B"."Path", "D"."Length", "D"."FullHash", "A"."Lastmodified", "F"."Hash" AS "FirstMetaBlockHash", "C"."BlocksetID" AS "MetaBlocksetID" FROM "FilesetEntry" A, "File" B, "Metadataset" C, "Blockset" D, "BlocksetEntry" E, "Block" F WHERE "A"."FileID" = "B"."ID" AND "B"."MetadataID" = "C"."ID" AND "C"."BlocksetID" = "D"."ID" AND "E"."BlocksetID" = "C"."BlocksetID" AND "E"."BlockID" = "F"."ID" AND "E"."Index" = 0 AND ("B"."BlocksetID" = ? OR "B"."BlocksetID" = ?) AND "A"."FilesetID" = ? ) G LEFT OUTER JOIN "BlocklistHash" H ON "H"."BlocksetID" = "G"."MetaBlocksetID" ORDER BY "G"."Path", "H"."Index" ))
[...]

some of these SELECT statements take approx 4h; some finish instantly.
The process has not finished yet...

@sftwr-ngnr
Copy link

sftwr-ngnr commented Dec 21, 2016

Same issue for me while recovering DB with v2.0.1.34 on Windows and FTP storage (box.com, Backup: 47.51 GB / 58 Versions / 1380 files): It didn't complete after 24h... Then I aborted and initialized a new full backup.
While recovering I could see constantly very high CPU usage at 75% (maybe that number is related to the "lzma-thread-count=3" I configured). There were, over time, peeks in network traffic (1Gbit line), but I cannot say what it fetched.
Via the box.com website I queried for some access stats on files. It seems it only fetched *.dlist.7z.aes files. At least in my samples I couldn't find any access on the *.dblock.7z.aes & *.dindex.7z.aes files.

kenkendk added a commit that referenced this issue Jan 18, 2017
This is likely a fix to these issues:
#1699, #2048, #2140, #2178

The speedup is also drastic in the core query. A test database had the time reduced for 2.5 hours to 1 minute.
This could also help with issues:
#1391, #2050
kenkendk added a commit that referenced this issue Jan 19, 2017
This is a less experimental rewrite of the query that fixes the exact problem.
The speedups from this query fix are also substantial, going from 2.5 hours to 11 minutes.

This is likely a fix to these issues:
#1699, #2048, #2140, #2178

This could help with issues:
#1391, #2050
@kenkendk
Copy link
Member

kenkendk commented Feb 2, 2017

I am collecting performance related issues under #2302

@kenkendk kenkendk closed this as completed Feb 2, 2017
@rkarlsba
Copy link

rkarlsba commented Oct 2, 2017

I'm seeing something similar - single core with 100% CPU and running for ages, not doing much. Since everyhing seems to be singlethreaded, it effectively blocks the entire duplicati application. If it's as simple as stated above to add a custom connect string, I beleive this should be done ASAP to allow those of us that want to use a real RDBMS to do so. SQLite3 is imho good for small stuff, but it really sucks on locking, which sesems to have struck this time. PostgreSQL does not have such an issue, and scales vastly better. MySQL/MariaDB/whateverfork also is far better, although I prefer PostgreSQL over them.

@mdsmdsmds
Copy link

@kenkendk When could be possible to implement the external DB connection string?

Everything inside Duplicati is using a standard IDbConnection that could point to any database, but we do not currently have anything that allows you to set a custom connectionstring, but that should not be too hard to change.
I think allowing this would make sense to solve the enterprise scenario you mention, but I would not expect it to become much faster.

@rkarlsba
Copy link

rkarlsba commented Apr 7, 2018

Allowing external databases isn't just about speed, it's also about integrity. sqlite is fine, but using something like postgresql is a lot safer, and faster, and allowing those of us that knows things a bit to customise this, would be very nice indeed.

@mdsmdsmds
Copy link

@rkarlsba
You’re right. I totally agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests