Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization performance #2087

Merged

Conversation

crsib
Copy link
Contributor

@crsib crsib commented Nov 3, 2021

A first attempt to improve the loading times for large projects. This improves both performance and memory usage by around 2x.

46 Mb binary produces 75 Mb XML stream, which requires around 90 Mb of RAM and 5 seconds to load. Previously, RAM overhead was ~380 Mb, the loading time was 11 seconds.

The peak memory usage is now 221 Mb down from ~580. Audacity uses 134 Mb right after the project is loaded.

XML processing is now responsible for 78% of that time:
image

Half of the time is spent in "StartElement"
image

26% of the time is spent on constructing wxString objects.

  • I signed CLA
  • The title of the pull request describes an issue it addresses
  • If changes are extensive, then there is a sequence of easily reviewable commits
  • Each commit's message describes its purpose and effects
  • There are no behavior changes unnecessary for the stated purpose of the PR

Recommended:

  • Each commit compiles and runs on my machine without known undesirable changes of behavior

/*!
* @brief A low overhead memory stream with O(1) append, low heap fragmentation and a linear memory view.
*
* wxMemoryBuffer always appends 1Kb to the end of the buffer, causing severe performance issues
* and significant heap fragmentation. There is no possibility to controll the increment value.
* and significant heap fragmentation. There is no possibility to control the increment value.
*
* std::vector doubles it's memory size which can be problematic for large projects as well.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not it's

using StreamChunk = std::pair<const void*, size_t>;

private:
static constexpr size_t ChunkSize = 1024 * 1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there would be less waste of operating system pages if you reduced this constant by 2 * sizeof(void*) ? That is, sufficient space for the overhead of a std::list node in the probable implementation. I'm guessing the operating system is instead allocating a few big pages only to use 16 bytes for the end of the chunk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, makes total sense, thank you!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should be more than two pointers, for operator new overhead. Just how much would of course vary with platform and debug configuration.

@@ -30,7 +30,10 @@ the general functionality for creating XML in UTF8 encoding.
#include <wx/ffile.h>
#include <wx/intl.h>

#include <string.h>
#include <cstring>
#include <charconv>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you now know, is incomplete for macOS and not even known in the Linux build. Sorry, you must find an alternative to to_chars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm painfully aware of it now. This is really sad that a very simple part of C++17 is not implemented fully in 2021

@crsib crsib force-pushed the 2051_deserialization_performance branch 2 times, most recently from 41e23c9 to 9d2a064 Compare November 8, 2021 15:04
@@ -394,45 +395,45 @@ wxString ProjectSerializer::Decode(const wxMemoryBuffer &buffer)
mIds.clear();

struct Error{}; // exception type for short-range try/catch
auto Lookup = [&mIds]( UShort id ) -> const wxString &
auto Lookup = [&mIds]( UShort id ) -> std::string_view
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look below to the FT_Push and FT_Pop cases. I think there is an opportunity to use move assignment or swaps for a bit more performance, instead of copying of mIds.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FT_Push and FT_Pop never happen, in fact!

using StreamChunk = std::pair<const void*, size_t>;

private:
static constexpr size_t ChunkSize = 1024 * 1024;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it should be more than two pointers, for operator new overhead. Just how much would of course vary with platform and debug configuration.

static constexpr size_t ChunkSize =
1024 * 1024 - // 1Mb
2 * sizeof(void*) - // account for the list node pointers
sizeof(size_t); // account for the bytes used member
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe subtract a little more for operator new overhead. Make a guess what that is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw no measurable difference here, to be honest. If CRT decides that it will allocate page size aligned amount of memory - we will have an overhead of 4K (less than 1% percent overhead in this case) and will be likely able to reuse it. If CRT fallbacks to some mmap kind of API - there is no overhead that we can control here. Controlling the overhead from std::allocator is more feasible, but I'm not sure if we really need such control. (And the simplest way is to simply use a handwritten allocator)

Anyway, the current bottleneck is very far away from this place both for serialization and deserialization. A better fix would be to perform incremental IO directly from the database, but SQLite3 interface doesn't really have an incremental write API. We can avoid memory "linearization" though because sqlite3_bind_zeroblob is a very fast API (not actually committing any pages, rather just setting a propper header)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shrug. Whatever easy little help we can get. Yes, measure first.

@crsib crsib marked this pull request as ready for review November 9, 2021 09:24
@crsib crsib force-pushed the 2051_deserialization_performance branch from 9d2a064 to cb670da Compare November 9, 2021 09:34
@Paul-Licameli
Copy link
Collaborator

I have reviewed everything but the XMLUtf8BufferWriter class and have no objections to those parts.

I assume the ToChars implementation is just a great cut and paste of trusted code. Or are there any differences from your source I should know about?

@crsib
Copy link
Contributor Author

crsib commented Nov 9, 2021

I assume the ToChars implementation is just a great cut and paste of trusted code. Or are there any differences from your source I should know about?

The only major difference is that I have added buffer size checks to Grisu2, which always blindly assumed that "the buffer is big enough"

@Paul-Licameli
Copy link
Collaborator

You don't yet use sqlite3_bind_zeroblob. Do you mean to?

This would allow skipping MemoryStream::GetData() which is one huge allocation and memory movement?

Why is the blob handle interface "not really" the incremental write we need?

Are there other opportunities for performance improvementh you have seen but not yet implemented in this PR?

{
constexpr size_t bufferSize = std::numeric_limits<float>::max_digits10 +
5 + // No constexpr log2 yet! example - e-308
3; // Dot, sing an 0 separator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the blob handle interface "not really" the incremental write we need?

SQLite needs to know the blob size in the advance, so there is no way to stream the data into the blob as you can into the file, for example. However, it is possible to reuse the "iterators" interface and drop the GetData call.

Copy link
Collaborator

@Paul-Licameli Paul-Licameli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment typos

{
constexpr size_t bufferSize = std::numeric_limits<double>::max_digits10 +
5 + // No constexpr log2 yet!
3; // Dot, sing an 0 separator
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign

WriteEscaped(value);
}

void XMLUtf8BufferWriter::WriteSubTree(const std::string_view& value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this isn’t used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just copied the "Writer" interface, but I agree that this can be dropped.

@Paul-Licameli
Copy link
Collaborator

Approved, let’s move

@crsib crsib force-pushed the 2051_deserialization_performance branch from a81eac4 to c710ea0 Compare November 9, 2021 11:29
@crsib crsib merged commit f89b428 into audacity:release-3.1.1 Nov 9, 2021
@crsib crsib deleted the 2051_deserialization_performance branch November 9, 2021 11:38
@Paul-Licameli
Copy link
Collaborator

Paul-Licameli commented Nov 9, 2021

#2087 (comment)

But we can know the blob size before we write it?

We still capture data in chunks and move it into the blob, but we could eliminate the intermediate move into one big contiguous array. Correct? That would be a win.

You were saying we might eliminate more moves if we could resize the blob as we write it? Yeah, no luck there, no additional win.

@crsib
Copy link
Contributor Author

crsib commented Nov 9, 2021

But we can know the blob size before we write it?
We still capture data in chunks and move it into the blob

That is what I plan to do in a next PR

@crsib crsib mentioned this pull request Nov 11, 2021
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants