Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency between the interface of vector bindings and python list & potential wild pointer problems. #32

Open
Yikai-Liao opened this issue Feb 29, 2024 · 53 comments
Assignees

Comments

@Yikai-Liao
Copy link
Owner

Following #28 , I'm considering adding inplace argument for other functions. But I find that inplace operation does bring some inconsistency between the interface of vector bindings and python list. Also, this might lead to some wild pointer.

An Experiment

Here is an experiment I conducted:
img_v3_028f_fcaedb68-5964-4834-97d5-158c95c763cg

Theoretical Analysis

Essentially, a python list holds an array of pointers (pointing to the real data), while the std::vector holds an array of the real data. Consider the following code:

notes = [Note(i, 0, 0, 0) for i in range(10)]
note = notes[-1]
notes.insert(0, Note(0, 0, 0, 0))
assert note == Note(9, 0, 0, 0)

For python list, we always expect the note (a reference to the last element of that list) to be Note(9, 0, 0, 0) no matter how we insert or remove elements in the list. Since in such operations, we only move some pointers, and we don't change the position of any real data.

notes = NoteTickList([Note(i, 0, 0, 0) for i in range(10)])
note = notes[-1]
notes.insert(0, Note(0, 0, 0, 0))
assert note == Note(8, 0, 0, 0)

For the vector bindings, e.g. NoteTickList, we do move the real data when insert the new note, while the pointer in note remains the same. Normally (If we don't reach the vector's capacity), the note will still point to the 10th element in notes, which is Note(8, 0, 0, 0). So this is an inconsistency between the bindings and python list.

And if we reach the capacity, vector will automatically copy all the data into a new, larger block of memory and free the original. Then, the note will become a "wild pointer", pointing to an invalid block of memory. If it's unfortunate enough, this will cause your program to trigger a segmentation fault.

In conclusion, even if such inconsistency is acceptable any operations that modify the position of the real data might lead to some potential "wild pointer".

Possible Solution

@Natooz I'm quite concerned about this issue and no perfect solution has been found.

  • One possible solution is to disable all operations that change the location of the data, and replace it with a copy, like numpy's append. And as we all know, numpy.append is slow and something that should be avoided.
  • Another is to Replace vector<Note> with vector<shared_ptr<Note>>. This makes c++ vectors perform like python lists and introduces a lot of memory fragmentation and overheads.

Both of these schemes introduce a significant overhead, and neither looks elegant or pythonic.

Of course, we could write in the documentation and the README to tell the user not to hold a reference to an event for a long time, but to copy it manually.

But I think there are still a lot of people who don't read these instructions, and then run into the problems I've analyzed.

Improving performance always comes at a price, which can take various forms.

@Natooz
Copy link
Contributor

Natooz commented Feb 29, 2024

Of course, we could write in the documentation and the README to tell the user not to hold a reference to an event for a long time, but to copy it manually.

But I think there are still a lot of people who don't read these instructions, and then run into the problems I've analyzed.

I had the exact same thoughts just before reading this passage.
But it might be the best option in the meantime before moving to changes?

@Yikai-Liao
Copy link
Owner Author

I'll add the notice as soon as possible.

@leleogere, Sorry to interrupt, but I wanted to ask what tool you used to create your previous diagram. It would be easier to understand if I could draw a schematic of the memory layout

@leleogere
Copy link

leleogere commented Mar 1, 2024

@leleogere, Sorry to interrupt, but I wanted to ask what tool you used to create your previous diagram. It would be easier to understand if I could draw a schematic of the memory layout

I did them manually in Inkscape (an open source vector drawing app). It is quite powerful, but the learning curve might be a bit steep for beginners.

About the topic of this issue, I don't know a lot of C++, but personally I would expect from a python library to stick as much as possible to the python behavior. Asking people "not to hold a reference to an event for a long time" is kinda vague and might confuse them even more. I'm not sure what is the best solution to solve this, but I thing that I would rather a (small?) performance hit than a mysterious segmentation fault, eventually happening dozens of lines after the real origin of the problem.

@Yikai-Liao
Copy link
Owner Author

@leleogere Thanks. I agree with you.

I'm now discussing with the author of nanobind. wjakob/nanobind#432

I've come up with a possible solution that has a smaller overhead. I'll be refining this idea these days.

@Yikai-Liao
Copy link
Owner Author

@leleogere Thanks. I agree with you.

I'm now discussing with the author of nanobind. wjakob/nanobind#432

I've come up with a possible solution that has a smaller overhead. I'll be refining this idea these days.

As I was nearing the end of writing the container I had envisioned, I realized that this solution still didn't guarantee semantic alignment with Python lists. I ended up choosing to use a large number of shared ptrs to solve this problem.

Of course, this will have a relatively large impact on performance, and I will test this further once the changes are complete.

@Yikai-Liao
Copy link
Owner Author

@Natooz It's still a ways away from switching to shared_ptr, but I can probably predict its overhead already. This will introduce about 3x the overhead, both in time and space.

This may be acceptable, for which I'm not sure. But the overhead makes me uncomfortable (I admit, I'm a bit of a perfectionist), since most of the overhead in time comes from the frequent memory allocation of a single object, like a note. Theoretically, this should be able to be avoid, but I just don't find a way.

But perhaps we can solve this problem with interface design instead of using shared_ptr.

All the problems come from references. So if we copy an object in the list each time we access it, then all the problems would be solved. But this way makes it hard for users to modify a list. One possible way is to design an interface like the one in DataFrame.

# There are just a few tracks commonly, so we could use shared_ptr for tracks.
# And then, we could access a track throw reference
notes = s.tracks[0].notes  # reference, no copy

note = notes[1]           # copy
notes[1].pitch += 12      # the original data is not modified because of copy

# new interface
notes[1, "pitch"] += 12   # the original data is now modified

Admittedly this approach is a bit counter-intuitive, but it is currently the most "efficient" way to deal with the dangling pointer problem.

What's your opinion about the technological route?

@Natooz
Copy link
Contributor

Natooz commented Mar 31, 2024

Hi @Yikai-Liao , thank you for keeping me updated! 😃
I'll got busy this we, I'll reread everything tomorrow!

@Yikai-Liao Yikai-Liao self-assigned this Apr 1, 2024
@Natooz
Copy link
Contributor

Natooz commented Apr 1, 2024

I understand your "frustration" here as I'm also a perfectionist, especially when it comes to basic operations used for many downstream tasks.

First, as the issue here comes from the reference inconsistency between Python lists and vector bindings, it would be too bad to add overhead for all operations by switching everything to shared_ptr.

I think copying the data instead of returning a reference might be a preferable route, and would probably add less overhead overall, while remaining intuitive.
Now we could still throw references when iterating over the vector bindings (for note in track.notes), as this would still keep reduced overhead especially when the length of the list is large and there would no impact on the references.

For the DataFrame, I'm not sure it would be a good idea to implement it without a good reason as it would not be pythonic. Switching to such interface might confuse users, and would require efforts on the documentation and probably guide users.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Apr 12, 2024

I finally finished writing a container, pyvec, that will be used to solve the lifetime problem.

Luckily, after testing, the overhead introduced by this container is much smaller compared to shared_ptr, in many cases an order of magnitude smaller.

This means that after considering the overhead associated with python bindings, this overhead is largely negligible.

图片

Even when using only c++, I think this overhead is acceptable. Here are the test results of midi parsing from vtune, where the to_shared function is all the overhead introduced, which is small compared to parsing midi.

img_v3_029s_9065bddc-fc8a-4b20-aa16-24aeeeb196eg

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Apr 21, 2024

🥳 I have almost finished the new version. Now, the shared branch could be installed.

But since I have totally rewritten the binding codes, I'm note sure did I missed some functions that existed in old version. Further tests are needed.

Edit:

Well, firstly, I need to make it compiled on all platforms 😅.

@Yikai-Liao
Copy link
Owner Author

@Natooz The tests on symusic's github action have been passed🥳. If you have time, you could test if the new version works correctly on Miditok.

@Natooz
Copy link
Contributor

Natooz commented Apr 22, 2024

Wonderful! 🎉
I'm running the tests, the first blocking issue comes I believe from the from_numpy binding
Screenshot 2024-04-22 at 08 51 58

@Yikai-Liao
Copy link
Owner Author

Wonderful! 🎉 I'm running the tests, the first blocking issue comes I believe from the from_numpy binding Screenshot 2024-04-22 at 08 51 58

It's fixed now

@Natooz
Copy link
Contributor

Natooz commented Apr 22, 2024

Probably the same problem with the clip method
AttributeError: 'ScoreTick' object has no attribute 'clip'

And I also encounter an index error when trying to add an element to the tempos of a Score when there is none
Screenshot 2024-04-22 at 11 09 12
I assume the same thing would happen for time signatures

@Yikai-Liao
Copy link
Owner Author

Probably the same problem with the clip method AttributeError: 'ScoreTick' object has no attribute 'clip'

And I also encounter an index error when trying to add an element to the tempos of a Score when there is none Screenshot 2024-04-22 at 11 09 12 I assume the same thing would happen for time signatures

Thanks for you report. These 2 bugs have been fixed.

@Natooz
Copy link
Contributor

Natooz commented Apr 22, 2024

Thank you for being so reactive!

Next bug that I got here with extend on markers.
Screenshot 2024-04-22 at 13 17 32

Also, with the new version, is there some cases where a Score does not have any default time signature? Meaning score.time_signatures is just empty

@Yikai-Liao
Copy link
Owner Author

Thank you for being so reactive!

Next bug that I got here with extend on markers.
Screenshot 2024-04-22 at 13 17 32

Also, with the new version, is there some cases where a Score does not have any default time signature? Meaning score.time_signatures is just empty

Sorry, it looks like I missed the special handling of some empty cases in pyvec.

For the second question, I remember that I didn't add default values for tempos in older versions either.

@Natooz
Copy link
Contributor

Natooz commented Apr 22, 2024

Still a few things no passing :)
Screenshot 2024-04-22 at 17 39 10

Beginning with the creation of Tempo, the time attribute always ends up to 0 whatever the value is provided in argument
Screenshot 2024-04-22 at 17 46 33

Like with markers, the bindings for extend seems broken for tempos. Same for lyrics, key signatures, I assume other containers should checked too (time signatures...)
Screenshot 2024-04-22 at 17 55 17

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented May 14, 2024

@Natooz Sorry for the late reply. I've been busy with my graduation project.

The new version of symusic now passes all the tests in MidiTok. However, this involves a tweak to the copy function call. For now I have set up the interface semantics as follows:

function old semantic new semantic
copy deep copy shallow copy
__copy__ deep copy shallow copy
deepcopy not exist deep copy
__deepcopy__ implementation error, unusable deep copy

I noticed that in MidiTok's test code, the copy.copy function is used to make a deep copy of the Score. For newer versions, this should be changed to use copy.deepcopy or the member function deepcopy.

However, because of a mistake in my previous implementation (wrong definition of the __deepcopy__ function's entry parameter), there is not a consistent and usable deep copy interface between the old and new versions.

Therefore, I'm not sure this interface change I'm introducing makes sense.

@Yikai-Liao
Copy link
Owner Author

Just to be on the safe side, I think it's time for me to introduce coverage tests in this new release (although it's long overdue). For the time being, I think it's okay to release this as a beta version.

As well, I need to take the time to write out the documentation for all the api and discuss the rationale for the default arguments.

@Natooz
Copy link
Contributor

Natooz commented May 14, 2024

Wonderful ! Thank you for these heads-up! I think it makes total sense to bind the methods following the same principles. I'll make sure to update the methods in MidiTok.

For the tests, I intended to transfer some MidiTok's directly here, I'm especially thinking of Score splits and concatenations. Would you want me to do it?

@Yikai-Liao
Copy link
Owner Author

 I intended to transfer some MidiTok's directly here, I'm especially thinking of Score splits and concatenations.

@Natooz It would be great if you got time to do so.

And I will first add some tasks on cpp side. Do you think that we need to separate the Python test files and cpp test files into two sub folders?

@Natooz
Copy link
Contributor

Natooz commented May 15, 2024

I created a draft PR to add more tests. RN I just added a few ones, I'll add more and make sure everything runs smoothly when I'll more time later.

For the CPP tests, I don't know what the best practice is in this case, but I assume separating them in different directories wouldn't be a bad idea. tests/python, tests/cpp
Ideally they should be run by the tests action along with the python tests. I don't know which cpp test tool and GitHub action allowing to use it should be employed here, but that shouldn't be hard to figure. I can take a look later too.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented May 22, 2024

@Natooz I have merged shared into main branch. And I also sucessfully moved the symusic folder into python/symusic. So now you could run pytest locally without remove the symusic folder.

I change some copy function in test files into deepcoopy. But there are still 4 test cases that are failed. I don't quite understand what's wrong, so I'm asking for help here.

@Natooz
Copy link
Contributor

Natooz commented May 22, 2024

Wonderful!
About the tests, I think they are not passing because they weren't finished :)
Is it urgent? I can take the time to finish them and make sure it's working as it should, however I have a limited amount of time right now so I must prioritize, but I can make this effort if necessary!

@Yikai-Liao
Copy link
Owner Author

It's not urgent. Just deal with it when you have time.

@Natooz
Copy link
Contributor

Natooz commented May 30, 2024

I'll take care of the tests tomorrow! 🙌

@Natooz
Copy link
Contributor

Natooz commented May 31, 2024

The few tests that doesn't pass are due to some error in the equal assertion of ControlChangeTickList.

@pytest.mark.parametrize("file_path", MIDI_PATHS_MULTITRACK)
def test_split_concat_score(file_path: Path):
    score = Score(file_path)
    if score.end() == 0:
        pytest.skip("empty file")
    tick_split = score.tpq * 4
    score_1 = score.clip(0, tick_split, clip_end=False)
    score_2 = score.clip(tick_split, score.end(), clip_end=True).shift_time(-tick_split)

    # Concatenate split MIDIs and assert its equal to original one
    score_concat = concat_scores((score_1, score_2), [tick_split])

    # Assert the concatenated Score is identical to the original one
    # We do not test tempos, time signatures and key signature as they are duplicated
    # in score_concat (same consecutive ones for each chunk).
    t1 = score.tracks
    t2 = score_concat.tracks
    tt = [(tt1, tt2) for tt1, tt2 in zip(t1, t2) if tt1 != tt2]
    for tt1, tt2 in tt:
        neq = tt1.notes == tt2.notes
        ceq = tt1.controls == tt2.controls  # Can be False
        peq = tt1.pedals == tt2.pedals
        pbeq = tt1.pitch_bends == tt2.pitch_bends
        names_eq = tt1.name == tt2.name
        programs_eq = tt1.program == tt2.program

        n1 = [n for n in tt1.notes]
        n2 = [n for n in tt2.notes]
        nn = [(nn1, nn2) for nn1, nn2 in zip(n1, n2) if nn1 != nn2]
        cc = [(cc1, cc2) for cc1, cc2 in zip(tt1.controls, tt1.controls) if cc1 != cc2]  # Are all equals when compared individually
    assert score.tracks == score_concat.tracks
    assert score.lyrics == score_concat.lyrics
    assert score.markers == score_concat.markers

I tried to look into the stubs for any potential cause of this without success. Do you know what could cause this error?

Also I made the lints pass, will open a PR soon.

@Yikai-Liao
Copy link
Owner Author

@Natooz Maybe, tt1.controls and tt2.controls got different length? zip function ignores the part beyond the minimum length.

By the way, I think copy(deep=True) is better than copy and deepcopy, just like pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html#pandas.DataFrame.copy
What's your opinion?

@Natooz
Copy link
Contributor

Natooz commented Jun 2, 2024

Indeed they had different lengths, thank you for the hint.
I managed to fix the test (and make them pass), I misunderstood the effect of clip_end.

By the way, I think copy(deep=True) is better than copy and deepcopy

I find it more convenient and less redundant too!

@wrongbad
Copy link

I know I'm pretty late to the thread, but why not use std::shared_ptr<std::vector<Note>> for the underlying data and then for each python Note object hold std::shared_ptr<std::vector<Note>> + int idx.

This way you have stable references, and pythonic shared-ptr behaviors without the complexity of custom data structures

@Yikai-Liao
Copy link
Owner Author

I know I'm pretty late to the thread, but why not use std::shared_ptr<std::vector<Note>> for the underlying data and then for each python Note object hold std::shared_ptr<std::vector<Note>> + int idx.

This way you have stable references, and pythonic shared-ptr behaviors without the complexity of custom data structures

In fact, I tried this option and almost finished the code. But this scheme introduces a big performance overhead, probably because shared pointer contains a lot of atomic operations to ensure thread safety.

For now, I'm using my own implementation of pyvec to store events like notes, and shared_ptr<vector<shared_ptr>> to store tracks. This prevents tracks from making a deep copy when appending.

@wrongbad
Copy link

wrongbad commented Jun 11, 2024

Which operations were causing performance overheads? This should only penalize you when creating a lot of python proxy objects, and I would expect the penalty would be dwarfed by the cpython overhead of creating a lot of objects anyway.

Are you sure you're not mixing up std::vector<shared_ptr<Note>> and shared_ptr<std::vector<Note>>? I'm advocating the latter. And this is basically how numpy slices / views work for example.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 11, 2024

A vector automatically reallocates memory if it exceeds capacity during push back, causing all references to be invalidated on the Python side, resulting in pointer dangling problems. So simply using shared_ptr, you must use shared_ptr<vector<shared_ptr<Note>>>. In this case, the overhaed will be introduced when parsing the midi.

To avoid this overhead, we need to use something like the stable_vector in boost, which does not automatically reallocate memory. But stable_vector is a linked list in fact. I want the data structure to be better localized (cache friendly). So I wrote pyvec.

I wrote pyvec as a container precisely to achieve the kind of LAZY execution you envision. I'll draw a schematic of the pyvec data structure at a later date for ease of understanding, but I'm not much of a user of svg's drawing tools and I'll try to draw it by hand first.

The most important features of pyvec are:

  1. does not reallocate memory (stable)
  2. guarantees a high degree of memory contiguity (although not exactly)
  3. can only add new elements by deep copy, and finally releases all objects uniformly
  4. can create shared ptrs of any object in the container when needed
  5. can quickly create slices (shallow copies,just like the slice of list in python)

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 11, 2024

In numpy, any append or stack, concatenate operation will copy the whole array to avoid reallocating memory. And I think it's not acceptable, because we expect users to use the list container in symusic just like using a python list, not a numpy array

@wrongbad
Copy link

I think you missed my point. If your python objects hold these references, then no vector allocations would ever invalidate them:

std::shared_ptr<std::vector<Note>> array;
int index;

The Note* references would invalidate, but you would never be holding those anyway.

Your requirements also already describe std::deque<Note> which performs chunked allocations for appending to keep references stable, so it's even less clear to me why you needed to build something custom. But you pay some indexing overhead over the fully contiguous std::vector, and you still need the shared_ptr around the whole thing to prevent the root container deletion from invalidating Note proxies.

And I never suggested using the numpy append operator. I'm suggesting using std::vector for append. I'm just pointing out that it's a common pattern (in numpy and many other tensor libraries) for user friendly (pythonic) indexing and slicing to grab a shared_ptr to the root buffer and just hold indexing data, to prevent dangling references, exactly the problem you have here.

@Yikai-Liao
Copy link
Owner Author

Firstly, I want to make it clear that deque is not always stable.

Inserting or removing elements in the middle invalidates all iterators and references.

See here: https://stackoverflow.com/questions/26858111/will-stddeque-preserve-pointer-validity-of-its-contained-objects

Secondly, Let's think about your code.

std::shared_ptr<std::vector<Note>> cpp_arrary;

On the python side, we got a corresponding python object, py_array, holding a reference of cpp_array, or just a copy of the shared pointer in some implementation. Indeed, as long as py_array, the python object, is not freed, the destructor of cpp_array will never be executed.

But I need to make it clear that shared_ptr is not magic, it only affects when the destructor is executed.

# Build the py_array object and get a reference of its element
# We just create a shared_ptr<Note> from shared_ptr<vector<Note>>
# The destructor of the corresponding cpp_array is not executed until the python side releases obj0.
obj0 = py_array[0]

# Assuming that the size and capacity of py_array is 8 at the moment
# Append one object would cause reallocation
# During this process, the reference counts of all shared_ptr remain unchanged, and the destructor functions of obj0 and py_array are not executed
# However, the memory in which obj0's actual data is stored has already been automatically freed by the vector, a process that is not sensed by shared_ptr because it does not modify the reference count.
py_array.append( ... ) 

# Try access obj0
# You may find that obj0 is still accessible sometimes, because that part of the memory is not collected by the system in time, but it is already an access out of bounds.
print(obj0)

I only mentioned it before, not because it's the only problem, but I think it's deadly enough.
There are some other problems, for example:

# assume that size=5 and capacity=10
obj4 = py_array[4]
obj3_copied = deepcopy(py_array[3])

# Insert a new object in the front
# At this point, there is enough capacity in the vector, and the insert operation has moved all the elements back. 
py_array.insert(0, ...)

# However, from obj4's point of view, there is no sense of what has happened in the vector; the reference count of shared_ptr has not changed, the destructor has not been executed, and the location of the pointer has not changed. 
# So after the insert operation, obj4 will point to an element that was previously obj3.
assert obj4 == obj3_copied

All of the above analysis, I have done the actual testing myself (using both pybind11 and nanobind). I am trying to show that I did think long and hard about the solution you are talking about and experimented with it. I have also been upset about the fact that adding a shared_ptr directly does not meet the actual requirements, as it increases my workload considerably.

shared_ptr is not magic, it's just a reference counter that uses atomic operations and does not affect the behavior of a class. So I ended up implementing my own container, pyvec, which, although it uses shared_ptr internally to manage memory, is not the same as using shared_ptr directly.

Lastly, I also appreciate you being able to offer suggestions, and I'm happy to be able to share with others the tons of difficulties I've had dealing with this issue, especially since my initial thought process was in line with yours.

If there is anything else that I have not expressed clearly or misrepresented, please feel free to continue to ask questions.

@Yikai-Liao
Copy link
Owner Author

I know that there is a solution: when we accessing an element in a container, only returns a copy but not a reference. As long as references are not allowed, there is no pointer dangling problem. PyO3 just use this strategy to keep rust's memory safety. I also wanted to use pyo3 to write symusic in the first place.

However, I want to make it easier for users to modify the elements in the container, like this:

for note in notes:
    note.pitch += 12
# or
notes[-1].pitch += 12
notes[0].duration = 0

If this solution is used, the second way of modifying the elements in the container will fail even if we can use some tricks to allow reference access in the iterator.

It is for this reason that we discarded the midi parsing code that we had already written under rust before.

@wrongbad
Copy link

You're still missing my point about never holding note pointers, but rather only container pointers and indexes. This way it's literally impossible to have a dangling reference.

But your point about index shifting when inserting at the beginning or middle of an array is valid.

I think this highlights a key divergence in my use case and expectations from yours. I picture the parser (and writer) as standalone components, and editing as a separate process better served by other tools.

I ended up writing a prototype last night so I could move a different project forward. https://github.com/wrongbad/tensormidi

Refraining from custom data structures unlocks a ton of optimized manipulations doable with existing tools.

Convert delta microseconds to absolute seconds:

events = tensormidi.MergedTrack('song.mid')
abs_times = numpy.cumsum(events['dt'] / 1e6)

Filter by channels and event type:

events = tensormidi.MergedTrack('song.mid')
chan1 = events[events['channel'] == 1]
chan1notes = chan1[(chan1['type'] == NOTE_ON) | (chan1['type'] == NOTE_OFF)]

There are infinite ways to splice and mutate midi data, and it would be really hard to support all of them with custom APIs when you control the data structure and its mutations.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 11, 2024

As you said, having an index within each object holding vector is also an unusable solution, sorting, inserting and other operations invalidate it.

Secondly, to better support other libraries such as numpy, I chose to add the .numpy() function to convert the data to a SoA structure, specifically on the Python side it is Dict[str, np.ndarray].

@Yikai-Liao
Copy link
Owner Author

Based on your code, what I understand is that you are trying to decode midi to the midi message level numpy array or tensor.

In symusic, I chose to decode the midi to note level, and the SoA's mentioned earlier are all note level as well.

I choose Note level because when I was researching the field of symbolic music generation, there was almost no work left that directly followed the midi message token sequence, except for the early Music Transformer. A great deal of work has demonstrated that placing messages from the same Note in adjacent or the same token facilitates model learning.

Of course, this is also a parody of miditoolkit and pretty midi.

Note that converting midi messages to note level information requires a more complex control flow (if and else), and it is difficult to get high efficiency using numpy alone. And this is one of the things I want to speed up in symusic

@wrongbad
Copy link

wrongbad commented Jun 11, 2024

unusable solution

It all depends what you're trying to use it for. I think it's perfectly usable as long as you don't use the library itself to edit the data.

Personally I'd prefer the tradeoff where I can parse straight into a contiguous buffer which can be mapped directly into pytorch tensors with the buffer protocol and then sent to GPU, all in sub-ms time.

placing messages from the same Note in adjacent or the same token facilitates model learning

This is getting off topic but I'm actually very curious what you mean here. I'm also curious why you think it can't be done with vectorized numpy (or torch) ops. Are you talking about tokenizing with duration, rather than "note off" events?

@Yikai-Liao
Copy link
Owner Author

In fact, on the c++ side, we have already abstracted the code for the midi message layer, the minimidi mentioned in the readme. But for simplicity, there is currently no interface to bind the midi message level.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 11, 2024

First, on the subject of saving the index, why would you need to save the index in a scenario where you don't need to edit it? The size of a note is smaller than shared_ptr + size_t. Copy is obviously better in this case (just like what pyo3 did).

Note level means that you can directly read the start and end time, velocity, pitch, and track of a note.

Message level means that you're just reading a sequence of messages directly, keeping the original sequence of events, and the note on and note off events for the same note will be spaced out by an uncertain number of events.

Second, about the midi message to Note level decoding problem, you need to solve a lot of note on and note off pairing problems, determine each note on the time corresponding to the track and channel. Different events such as control change, set tempo, etc., need corresponding code to decode specific information, not to mention the meta message has variable length.

All of these suggese that you need to Iterate through all messages sequentially, with lots of if and else blocks, which does not suit parallel processing.

In symusic's decoding performance test, it is also observed that the compiler's automatic SIMD optimization does not affect the decoding performance either ( confirming on the side that note level decoding is difficult to be parallelized).

@Yikai-Liao
Copy link
Owner Author

For the development history of symbolic music representation, I recommend you to read Music Transforer -> REMI -> Compound Word

It's true that symbolic music representations are still evolving, and even with tokenizers like MidiTok that can be used directly, people are still modifying their representations, but there are still some trends that we can see in the development process.

And I wrote the library symusic to replace miditoolkit and pretty midi to speed up the process of decoding MIDI files to note level information. After decoding to note level, a lot of operations (such as data enhancement, etc.) can be done directly at the numpy level in quick and easy batch operations.

@wrongbad
Copy link

why would you need to save the index in a scenario where you don't need to edit it?

You could still allow in-place note edits without allowing insert/delete, as it would be less surprising who try to modify notes. But since writing my initial suggestion, I think the numpy structured array is much better. The reference and mutation semantics are well established now.

Yeah, I guess duration requires sequential processing. With a pure-numpy input, you could write a numba.jit function that would approach c++ speed. It could also just be a short c++ helper function to convert note_offs to durations.

In any case, I'm still not seeing much use for supporting sequence insertions or deletions. The numpy philosophy is for length-altering ops to just copy / fill a new destination array. Since you're generally adding or removing many notes, this is actually much faster in common cases than using a data-structure with insertion capability, and applying each insertion / deletion sequentially.

I've been working on low-latency causal modeling, where durations aren't yet known when you need model predictions. I know duration helps with stuck-notes, but I've found that including the note-state as an auxiliary input also mitigates this.

@Yikai-Liao
Copy link
Owner Author

As for decoding speed, while I know symusic has some room for optimization, it shouldn't make a quantum difference. We did a lot of optimization against the profiling tool, which improved performance several times over the original version, and the sub-millisecond decoding speeds that don't show up in Benchmark are due to the fact that mahler.mid is a very large midi file, basically the largest midi file we could find.

@wrongbad
Copy link

wrongbad commented Jun 11, 2024

Can you share a link where I can find mahler.mid?

I've been testing on bread-midi-dataset, measuring aggregates in files / second.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 11, 2024

First of all, it's still about saving indexes. If you want to support inplace editing without changing the order, pybind11 and nanobind's default vector binding strategy (reference) is very efficient and you don't need to modify it yourself, and saving indexes may be even more inefficient.

Of course, if your goal is to support append, extend, inplace editing and keeping references of some element, with operations such as insert remove sort clear disabled, then preserving the index is indeed a viable option.

Secondly, I also considered using numba's solution, but there are two problems with it. One, if you want to decode to a traditional SoA data structure, you need to use numba's jitclass, which has been maintained in experimental for several years and has low performance when accessed in pure python code. Second, decoding to only one SoA data structure, as I mentioned above, Dict[str, np.ndarray], would make it very cumbersome or unintuitive to create new scores note by note. This second point is also the reason why I kept the data structure similar to miditoolkit. (For symbolic music generation, converting model-generated sequences to midi usually involves creating a score note by note.)

In symusic, I chose the traditional AoS structure as the main interaction interface and provided a fast bidirectional conversion interface to the numpy-based AoS structure. This seems like a good tradeoff to me. Of course, you can't expect one library to fulfill everyone's needs, and maybe my design doesn't fit your application scenario.

Third, here is mahler: https://github.com/lzqlzzq/minimidi/blob/main/example/mahler.mid

By the way, if you haven't had experience implementing midi decoding before, I recommend taking a look at our minimidi, which is a single header file c++ library for parsing midi files to the message level(also with the corresponding encoding code). We did run into quite a few pitfalls, or special cases to deal with, in implementing the decoding part, although most of them exist in the message level to note level conversion process.

@wrongbad
Copy link

wrongbad commented Jun 11, 2024

Actually numba understands structured arrays perfectly out of the box. Numpy's structured arrays give you the best of both worlds with SoA and AoS functionality and syntax while being densely packed in both dimensions. Here's a working example to extract note durations

import numba 

@numba.jit
def durations(events):
    n = len(events)
    out = np.zeros(n, dtype=np.uint32)
    durs = np.zeros((16, 128), dtype=np.uint32)
    for i in range(n-1, -1, -1):
        e = events[i]
        if e.type == 0x90: # NOTE_ON
            out[i] = durs[e.channel, e.key]
        elif e.type == 0x80: # NOTE_OFF
            durs[e.channel, e.key] = 0
        durs += e.dt
    return out

fname = os.path.expanduser('~/dev/data/midi/mahler.mid')
events = tensormidi.MergedTrack(fname).events

durs = durations(events)

durs = durs[events['type'] == 0x90]
events = events[events['type'] == 0x90]

print(events[:20], durs[:20])

In this case, events.dt is already in microsecond units with set_tempo events handled inside MergedTrack.

@wrongbad
Copy link

For building scores note-by-note, is there any downside to using a python list, and then bulk packing with np.array() at the end? Or just pre-allocating a buffer with number of notes to generate, and set index i++ each step?

@Yikai-Liao
Copy link
Owner Author

Sorry for the typo, I meant to say that numba doesn't handle Array of Structure well.

As for your question about whether we can use Python's lists to solve the problem of creating score note by note, it's really a matter of whether the library is AoS or SoA based, and of course both options are possible. symusic inherits from miditoolkit, which is AoS based, and since the data structures are all in c++, the conversion to SoA is very fast.

If you choose SoA-based and use Python's native class to implement AoS, you will face the problem of slow conversion speed.

This is where the tradeoffs need to be weighed based on the usage scenario.

@Yikai-Liao
Copy link
Owner Author

Yikai-Liao commented Jun 12, 2024

IMG_20240612_133122.jpg

This is a schematic of pyvec's memory layout, essentially adding a layer of pointers @wrongbad

Since most of the time we don't adjust the order of the notes we read, I'd say most cases have good memory continuity.

Because ptrs is encapsulated inside pyvec and all pointers in ptrs are required to point to data in buffers, operations such as append force a copy of the object into buffers and release it at the end with other objects in buffers.

Such a data structure also allows us to perform fast slicing, i.e., copying the pointer of the corresponding range and a shared ptr. This is consistent with the complexity of Python list slicing, O(K), where K is the length of the target slice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants