Deserialization performance #2087

crsib · 2021-11-03T20:15:52Z

A first attempt to improve the loading times for large projects. This improves both performance and memory usage by around 2x.

46 Mb binary produces 75 Mb XML stream, which requires around 90 Mb of RAM and 5 seconds to load. Previously, RAM overhead was ~380 Mb, the loading time was 11 seconds.

The peak memory usage is now 221 Mb down from ~580. Audacity uses 134 Mb right after the project is loaded.

XML processing is now responsible for 78% of that time:

Half of the time is spent in "StartElement"

26% of the time is spent on constructing wxString objects.

I signed CLA
The title of the pull request describes an issue it addresses
If changes are extensive, then there is a sequence of easily reviewable commits
Each commit's message describes its purpose and effects
There are no behavior changes unnecessary for the stated purpose of the PR

Recommended:

Each commit compiles and runs on my machine without known undesirable changes of behavior

Paul-Licameli · 2021-11-08T03:41:26Z

libraries/lib-utility/MemoryStream.h

 /*!
 * @brief A low overhead memory stream with O(1) append, low heap fragmentation and a linear memory view.
 *
 * wxMemoryBuffer always appends 1Kb to the end of the buffer, causing severe performance issues
- * and significant heap fragmentation. There is no possibility to controll the increment value.
+ * and significant heap fragmentation. There is no possibility to control the increment value.
 *
 * std::vector doubles it's memory size which can be problematic for large projects as well.


its not it's

Paul-Licameli · 2021-11-08T03:43:47Z

libraries/lib-utility/MemoryStream.h

+   using StreamChunk = std::pair<const void*, size_t>;
+
+private:
+   static constexpr size_t ChunkSize = 1024 * 1024;


I wonder if there would be less waste of operating system pages if you reduced this constant by 2 * sizeof(void*) ? That is, sufficient space for the overhead of a std::list node in the probable implementation. I'm guessing the operating system is instead allocating a few big pages only to use 16 bytes for the end of the chunk.

Yeah, makes total sense, thank you!

Actually it should be more than two pointers, for operator new overhead. Just how much would of course vary with platform and debug configuration.

Paul-Licameli · 2021-11-08T04:17:04Z

libraries/lib-xml/XMLWriter.cpp

@@ -30,7 +30,10 @@ the general functionality for creating XML in UTF8 encoding.
 #include <wx/ffile.h>
 #include <wx/intl.h>

-#include <string.h>
+#include <cstring>
+#include <charconv>


As you now know, is incomplete for macOS and not even known in the Linux build. Sorry, you must find an alternative to to_chars.

I'm painfully aware of it now. This is really sad that a very simple part of C++17 is not implemented fully in 2021

Paul-Licameli · 2021-11-08T04:27:14Z

src/ProjectSerializer.cpp

@@ -394,45 +395,45 @@ wxString ProjectSerializer::Decode(const wxMemoryBuffer &buffer)
   mIds.clear();

   struct Error{}; // exception type for short-range try/catch
-   auto Lookup = [&mIds]( UShort id ) -> const wxString &
+   auto Lookup = [&mIds]( UShort id ) -> std::string_view


Look below to the FT_Push and FT_Pop cases. I think there is an opportunity to use move assignment or swaps for a bit more performance, instead of copying of mIds.

FT_Push and FT_Pop never happen, in fact!

Paul-Licameli · 2021-11-08T13:22:09Z

libraries/lib-utility/MemoryStream.h

+   using StreamChunk = std::pair<const void*, size_t>;
+
+private:
+   static constexpr size_t ChunkSize = 1024 * 1024;


Actually it should be more than two pointers, for operator new overhead. Just how much would of course vary with platform and debug configuration.

Paul-Licameli · 2021-11-09T03:48:18Z

libraries/lib-utility/MemoryStream.h

+   static constexpr size_t ChunkSize =
+      1024 * 1024 - // 1Mb
+      2 * sizeof(void*) - // account for the list node pointers
+      sizeof(size_t); // account for the bytes used member


Maybe subtract a little more for operator new overhead. Make a guess what that is.

I saw no measurable difference here, to be honest. If CRT decides that it will allocate page size aligned amount of memory - we will have an overhead of 4K (less than 1% percent overhead in this case) and will be likely able to reuse it. If CRT fallbacks to some mmap kind of API - there is no overhead that we can control here. Controlling the overhead from std::allocator is more feasible, but I'm not sure if we really need such control. (And the simplest way is to simply use a handwritten allocator)

Anyway, the current bottleneck is very far away from this place both for serialization and deserialization. A better fix would be to perform incremental IO directly from the database, but SQLite3 interface doesn't really have an incremental write API. We can avoid memory "linearization" though because sqlite3_bind_zeroblob is a very fast API (not actually committing any pages, rather just setting a propper header)

Shrug. Whatever easy little help we can get. Yes, measure first.

libraries/lib-string-utils/ToChars.h

Paul-Licameli · 2021-11-09T10:18:25Z

I have reviewed everything but the XMLUtf8BufferWriter class and have no objections to those parts.

I assume the ToChars implementation is just a great cut and paste of trusted code. Or are there any differences from your source I should know about?

crsib · 2021-11-09T10:45:23Z

I assume the ToChars implementation is just a great cut and paste of trusted code. Or are there any differences from your source I should know about?

The only major difference is that I have added buffer size checks to Grisu2, which always blindly assumed that "the buffer is big enough"

Paul-Licameli · 2021-11-09T10:50:58Z

You don't yet use sqlite3_bind_zeroblob. Do you mean to?

This would allow skipping MemoryStream::GetData() which is one huge allocation and memory movement?

Why is the blob handle interface "not really" the incremental write we need?

Are there other opportunities for performance improvementh you have seen but not yet implemented in this PR?

Paul-Licameli · 2021-11-09T10:59:02Z

libraries/lib-xml/XMLWriter.cpp

+{
+   constexpr size_t bufferSize = std::numeric_limits<float>::max_digits10 +
+                                 5 + // No constexpr log2 yet! example - e-308
+                                 3; // Dot, sing an 0 separator


Why is the blob handle interface "not really" the incremental write we need?

SQLite needs to know the blob size in the advance, so there is no way to stream the data into the blob as you can into the file, for example. However, it is possible to reuse the "iterators" interface and drop the GetData call.

Paul-Licameli

Comment typos

Paul-Licameli · 2021-11-09T11:00:17Z

libraries/lib-xml/XMLWriter.cpp

+{
+   constexpr size_t bufferSize = std::numeric_limits<double>::max_digits10 +
+                                 5 + // No constexpr log2 yet!
+                                 3;  // Dot, sing an 0 separator


Paul-Licameli · 2021-11-09T11:01:01Z

libraries/lib-xml/XMLWriter.cpp

+   WriteEscaped(value);
+}
+
+void XMLUtf8BufferWriter::WriteSubTree(const std::string_view& value)


I think this isn’t used

I've just copied the "Writer" interface, but I agree that this can be dropped.

Paul-Licameli · 2021-11-09T11:02:39Z

Approved, let’s move

Modified Grisu2 is used for floating point numbers, jeaiii for integers Grisu2 is simplish and widely used for serialization, suits our needs completely.

Paul-Licameli · 2021-11-09T12:43:58Z

#2087 (comment)

But we can know the blob size before we write it?

We still capture data in chunks and move it into the blob, but we could eliminate the intermediate move into one big contiguous array. Correct? That would be a win.

You were saying we might eliminate more moves if we could resize the blob as we write it? Yeah, no luck there, no additional win.

crsib · 2021-11-09T12:48:15Z

But we can know the blob size before we write it?
We still capture data in chunks and move it into the blob

That is what I plan to do in a next PR

crsib requested review from Paul-Licameli and vsverchinsky November 3, 2021 20:15

Paul-Licameli suggested changes Nov 8, 2021

View reviewed changes

crsib force-pushed the 2051_deserialization_performance branch 2 times, most recently from 41e23c9 to 9d2a064 Compare November 8, 2021 15:04

Paul-Licameli reviewed Nov 9, 2021

View reviewed changes

libraries/lib-string-utils/ToChars.h Outdated Show resolved Hide resolved

crsib marked this pull request as ready for review November 9, 2021 09:24

crsib force-pushed the 2051_deserialization_performance branch from 9d2a064 to cb670da Compare November 9, 2021 09:34

Paul-Licameli reviewed Nov 9, 2021

View reviewed changes

Paul-Licameli approved these changes Nov 9, 2021

View reviewed changes

crsib added 5 commits November 9, 2021 14:16

Allow iterating MemoryStream chunk by chunk to avoid memory copying

72cadd4

Adds ToChars similar to std::to_chars, but portable

950ffa5

Modified Grisu2 is used for floating point numbers, jeaiii for integers Grisu2 is simplish and widely used for serialization, suits our needs completely.

Adds support to read XML stream from MemoryStream

5c7d995

Decode now serialized to UTF8 into MemoryStream

01ab379

Prevent wxString construction for constants

c710ea0

crsib force-pushed the 2051_deserialization_performance branch from a81eac4 to c710ea0 Compare November 9, 2021 11:29

crsib merged commit f89b428 into audacity:release-3.1.1 Nov 9, 2021

crsib deleted the 2051_deserialization_performance branch November 9, 2021 11:38

crsib mentioned this pull request Nov 11, 2021

Deserialization cleanup #2121

Merged

6 tasks

LWinterberg mentioned this pull request Nov 17, 2021

Fixes detach at silences by trimming unnecessary silence #2312 #2152

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deserialization performance #2087

Deserialization performance #2087

crsib commented Nov 3, 2021

Paul-Licameli Nov 8, 2021

Paul-Licameli Nov 8, 2021

crsib Nov 8, 2021

Paul-Licameli Nov 8, 2021

Paul-Licameli Nov 8, 2021

crsib Nov 8, 2021

Paul-Licameli Nov 8, 2021

Paul-Licameli Nov 9, 2021

Paul-Licameli Nov 8, 2021

Paul-Licameli Nov 9, 2021

crsib Nov 9, 2021

Paul-Licameli Nov 9, 2021

Paul-Licameli commented Nov 9, 2021

crsib commented Nov 9, 2021

Paul-Licameli commented Nov 9, 2021

Paul-Licameli Nov 9, 2021

crsib Nov 9, 2021

Paul-Licameli left a comment

Paul-Licameli Nov 9, 2021

Paul-Licameli Nov 9, 2021

crsib Nov 9, 2021

Paul-Licameli commented Nov 9, 2021

Paul-Licameli commented Nov 9, 2021 •

edited

Loading

crsib commented Nov 9, 2021

Deserialization performance #2087

Deserialization performance #2087

Conversation

crsib commented Nov 3, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Paul-Licameli commented Nov 9, 2021

crsib commented Nov 9, 2021

Paul-Licameli commented Nov 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Paul-Licameli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Paul-Licameli commented Nov 9, 2021

Paul-Licameli commented Nov 9, 2021 • edited Loading

crsib commented Nov 9, 2021

Paul-Licameli commented Nov 9, 2021 •

edited

Loading