New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deserialization cleanup #2121
Deserialization cleanup #2121
Conversation
|
|
||
| bool result = HandleXMLTag(UTF8CTOWX(tag), out_attrs.get()); | ||
| bool result = HandleXMLTag(tag, mCurrentTagAttributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By eliminating UTF8CTOWX, are you requring that tag strings to be canonicalized into utf-8 already, or requiring them just to be ASCII strings? They are all ASCII in practice (I just had to review them all). It is worth a comment.
You know I like to say things with type disctinctions where I can. If we were using C++20, I would use std::u8string_view instead of std::string_view. Not having C++20 and char8_t, what can we do? Maybe at least make a different type alias to make the disctinction in function declarations, thought we won't have type checking by the compiler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually think that char8_t implies any checks on encoding. However, we use Expat, and Expat seems to convert the output to UTF8 internally. So specifically for this case, I imply UTF8, and that is what originally implied.
The UTF8 allows to safely compare to ASCII strings given that all the symbols are from the first half of the table. This covers Audacity needs for 100% when working with tags and names. For attribute values - we explicitly convert from UTF8 to wxString where needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly it's safe when a tag is ASCII, and all of them are. But I'm asking, are we requiring them to be ASCII now that you remove UTF3CTOWX ?
You may miss my point about char8_t -- it is that, where a string or string_view is required to be utf8 encoding, it would be good to say that with one type, and where required to be ASCII, to say that without another type, and it would be best if the types do not implicitly interconvert, so the compiler will help us catch misuses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we requiring them to be ASCII
We can't even do it, because Expat will always output UTF8 encoded string.
where a string or string_view is required to be utf8 encoding
The safest assumption is that all std::string are UTF8 encoded to be fair.
|
|
||
| bool result = HandleXMLTag(UTF8CTOWX(tag), out_attrs.get()); | ||
| bool result = HandleXMLTag(tag, mCurrentTagAttributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By eliminating UTF8CTOWX, are you requring that tag strings to be canonicalized into utf-8 already, or requiring them just to be ASCII strings? They are all ASCII in practice (I just had to review them all). It is worth a comment.
You know I like to say things with type disctinctions where I can. If we were using C++20, I would use std::u8string_view instead of std::string_view. Not having C++20 and char8_t, what can we do? Maybe at least make a different type alias to make the disctinction in function declarations, thought we won't have type checking by the compiler.
|
I pushed a suggested commit, please review that! |
| @@ -0,0 +1,2947 @@ | |||
| // fast_float by Daniel Lemire | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking this whole file on trust. Are there any differences from the original I should know? As there were with ToChars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time it is just the latest release of the library!
| //! Result of the conversion, similar to std::from_chars_result | ||
| struct STRING_UTILS_API FromCharsResult final | ||
| { | ||
| const char* ptr; //! A pointer to the first character no matching the the pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"not matching the pattern"
| TagTable mTagTable; | ||
| std::forward_list<std::string> mTags; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably std::deque? It can be more efficient on some platforms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought deque deals with big pages. Existing ProjectFileIORegistry (alias XMLMethodRegistry) registrations are maybe too few to justify it, and they happen only at startup.
Whereas, the lookups into the map may be many, and we want all the savings we can find during deserialization.
The lookups may not really be many, yet, but XMLMethodRegistry may be reused for many more fields later.
I've left that out to see if I can see a measurable difference in performance and I've observed none. Still, that commit is a welcome addition, thank you! |
| FromChars( | ||
| valueString.data(), valueString.data() + valueString.length(), | ||
| value) | ||
| .ec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK to ignore .ptr in the FromCharsResult? (This is the only place you do ignore it)
|
Something is definitly wrong with autosave now. Do these steps:
Expected result: the second recovery works. I think you must recover twice to get the bug, and I think it happens only starting at the commit "Use the fast stream reader (still, reading from memory)". Of course it may be hard to be sure, when there is a memory overwrite. |
| } | ||
| } | ||
|
|
||
| mStringsCache.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mStringsCache is a deque, I assume because you need non-relocation of the strings that you cache in it (so that you can make string_views later).
But could repeated clearing of the deque and then re-allocation in it doing some wasteful reallocations? Are all pages freed when you clear()? Standard says std::vector::clear() does not change capacity, but I don't see the same stipulated for deque.
Maybe you can keep one dummy string in it, and resize to 1 instead of clearing, if it matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You really don't like deque ;-)
Usually, the problem with deque is considered to be different: the memory costs just for one item are very high because deque allocates in large chunks. std::sting is usually around 24-32 bytes (why, libc++?. No, really, why 32 bytes?). On Clang deque allocates no less the 4k, on GCC - 512 bytes, which surely fits quite a bit of strings in. Even horrible MSVC implementation is not a performance concern here. We have worse problems to be solved first, for example, Sequence::HandleXMLTag is currently the #1 issue, because it never estimates the resulting blocks count, so mBlock.push_back(wb); is consuming most of the time right now. Still, we have a 40x speed up, and that problem can be easily addressed in a separate, much smaller, and easier to review PR.
In case we ever end up having deque to be a performance problem here - the fix will be as easy as a custom allocator. And trust me - you have to write them a lot when you are fighting for the performance :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom allocator. I was wondering if you would find use for std::pmr::string and some other allocator for the character data.
| { return static_cast<std::make_unsigned_t<BaseCharType>>(c) < 0x7f; }); | ||
|
|
||
| if (isAscii) | ||
| return std::string(begin, end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct, but I don't understand what economy it gives you. Still there is one std::string construction. Are you sure there are in fact even more allocations in the non-ascii path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Around 15-20 percent speedup. C++ standard library fails horribly when it comes to working with strings and locales (nothing new, really. It was this way since the STL was created). Another approach would be to use alternative converters. It will be a bit slower in general, but much faster if there are lots of non ASCII labels in the project.
There was an issue with recovering a project, that never was saved. I do not observe the issue you described though |
Which issue? If it happens, it surely is an issue to fix. The unsaved data might be a long recording.
I know I saw this problem, and I find it reliably in my macOS debug build of this branch, though the exact symptoms are various. I'm not satisfied until there is more understanding. |
Your latest change to LoadProject may have made the problem go away, so it seems with a few tries, although I don't know why. |
|
I do not observe any problems with the latest build on macOS, not in release or debug builds. I think there can be an issue in ShowError |
|
class So should the class be deleted? Sad all that hard work obsoleted so soon. |
|
I just need to step through ProjectFileIO::LoadProject to be sure I understand all the changes in there, and I will approve if I think nothing is wrong, although the unexplained random problems I saw give me a bit of doubt that some problem was missed. |
To be fair, I think it's XMLStringWriter that will go away at some point, and XMLWriter interface will be adapted to match XMLUtf8BufferWriter. There is still room for performance gains in serialization, that can be achieved by removing |
For floats - excellent MIT licensed implementation by prof. D. Lemire is used. For integers - some template magic is used for universal, overflow checking implementation.
XMLAttributeValueView is a view class that holds possible attribute values. XMLAttributeValueView does not take ownership for the string values, so it should never be stored.
|
Well, that was epic! |
The idea behind the PR:
string_viewto represent tag and attribute namesThe infamous project I mentioned in #2087 loads in:
Additionally, I have a project from @Penikov, which has a 420 Mb project blob. For this case, I observe:
The dark side of this PR:
Recommended: