Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Force little-endian format when writing eeg datafile #80

Merged
merged 14 commits into from Sep 29, 2021

Conversation

Aniket-Pradhan
Copy link
Contributor

@Aniket-Pradhan Aniket-Pradhan commented Sep 24, 2021

Also updated the docs to add information about this change 😸

closes #79

@Aniket-Pradhan
Copy link
Contributor Author

Force pushed to fix the flake8 tests. Should work fine now

@codecov
Copy link

codecov bot commented Sep 24, 2021

Codecov Report

Merging #80 (cf048b7) into main (456d014) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main       #80   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            3         3           
  Lines          421       423    +2     
=========================================
+ Hits           421       423    +2     
Impacted Files Coverage Δ
pybv/io.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 456d014...cf048b7. Read the comment docs.

pybv/io.py Outdated Show resolved Hide resolved
pybv/io.py Outdated Show resolved Hide resolved
Co-authored-by: Clemens Brunner <clemens.brunner@gmail.com>
Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks both. Then only three things remain:

  1. @Aniket-Pradhan please add an entry to the changelog here: https://github.com/bids-standard/pybv/blob/main/docs/changelog.rst, copying the style of the other entries (and add yourself to the list of authors)
  2. Please append your name and info to https://github.com/bids-standard/pybv/blob/main/CITATION.cff as well
  3. please add a # pragma: no cover (or whatever it was that prevents code coverage analysis to account for the lines) to the if clause, because we are (deliberately) not covering that in our tests, and I don't want coverage to get lower due to deliberate decisions :-)

docs/changelog.rst Outdated Show resolved Hide resolved
pybv/io.py Outdated Show resolved Hide resolved
docs/changelog.rst Outdated Show resolved Hide resolved
@sappelhoff sappelhoff added this to the 0.6.0 milestone Sep 27, 2021
Co-authored-by: Clemens Brunner <clemens.brunner@gmail.com>
Copy link
Member

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks! I'll wait for a note from you that everything did indeed pass ... then we can merge and I'll make a new release.

@Aniket-Pradhan
Copy link
Contributor Author

Aniket-Pradhan commented Sep 27, 2021

There seems to be something wrong with the test, and I am guessing it is because we cannot pass a big-endian formatted array in a little-endian machine and expect it to carry out the scaling (and other arithmetic) operations correctly.

For example, if we reverse the bytes and change the dtype of a numpy array the scaling operation would give a different result on a little-endian and a big-endian machine.

Furthermore, numpy.tofile loses the endianness information of the numpy array and exports the data to whatever the machine's endianness is. The test was working earlier because I had accidentally passed the wrong variable, but upon reviewing the coverage report I found this mistake.

Therefore, I don't think it would be possible to test this section on a little-endian machine. OTOH, I guess we can replace if (sys.byteorder == "big" and byteorder == "=") or byteorder == ">": with if sys.byteorder == "big": because it would not make sense to work with big-endian dtype on a little-endian machine.

@sappelhoff
Copy link
Member

I see. In that case, let's still keep the check for endianness there - even if only 1 in 10k people run into it, it's a small thing to do and makes the code also easier to understand IMHO.

Then we remove the test, and add the pragma: no cover back, to get a 100% diff coverage

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 27, 2021

There seems to be something wrong with the test, and I am guessing it is because we cannot pass a big-endian formatted array in a little-endian machine and expect it to carry out the scaling (and other arithmetic) operations correctly.

Wait, that's exactly what should happen. If you create a big-endian array, NumPy should know how to handle that data and perform operations correctly.

E.g. these two arrays are identical after multiplication:

import numpy as np

little = np.arange(1e6, dtype="<f8").reshape(1000, 1000)
big = np.arange(1e6, dtype=">f8").reshape(1000, 1000)

little *= 7
big *= 7

np.array_equal(little, big)

If this example does not demonstrate the issue I'd be curious to see something that doesn't work.

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 27, 2021

When writing arrays to disk with .tofile(), the files are not identical, which means that they should differ in their endianness:

little.tofile("little.dat")
big.tofile("big.dat")
!md5sum big.dat little.dat 
5157b19a956273b22d8cd056b5eea320  big.dat
d2ae4a99442623c5ef2678a0ea5a3429  little.dat

@Aniket-Pradhan
Copy link
Contributor Author

Aniket-Pradhan commented Sep 27, 2021

You're right, numpy is handling the operations correctly. I was also swapping the bytes along with changing the dtype, which now seems the wrong way to do it 😭 .

However, the byte order of the numpy array is reset to the native machine's endianness "=" whenever it is getting assigned (ref lines 476, 480 and 493, where we set it manually from the supported formats).

import numpy as np

def scale_arr(arr, scale=10):
    return arr * scale

big = np.arange(1e6, dtype=">f8").reshape(1000, 1000)
big_scaled = scale_arr(big)

print(big.dtype, big.dtype.byteorder)
print(big_scaled.dtype, big_scaled.dtype.byteorder)

From what I can understand, I guess numpy only retains the datatype in the same scope of the function call.

Since the byte order is reset, the array is no longer little-endian, and hence the control doesn't go in the if condition, not allowing us to test the block, thereby reducing the coverage. This is why I had to revert to the old code because the coverage still dropped when I committed the test case the first time.

@sappelhoff
Copy link
Member

@cbrnr merge when you are happy.

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 28, 2021

Indeed it seems like byte order is not preserved when copies are involved, e.g. big = big * 7 will return a new array with native endianness but in-place operations such as big *= 7 will preserve endianness. Therefore, I would try to change operations involving copies with in-place variants if possible. However, this will probably require a ton of modifications that I'm not sure we want to maintain. So I guess I'll probably merge, but I would like to confirm with @sappelhoff.

pybv/io.py Outdated Show resolved Hide resolved
@sappelhoff
Copy link
Member

However, this will probably require a ton of modifications that I'm not sure we want to maintain. So I guess I'll probably merge, but I would like to confirm with @sappelhoff.

Yes, thanks for confirming with me - I also think we should merge. We clarified a major point in this issue/PR with Brain Products and made sure that pybv correctly writes as little-endian from now on. The only issue that remains would be a person on a little-endian machine creating a big-endian array, and then passing it to write_brainvision ... and I think it's safe to say that'll never happen (hehe).

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 28, 2021

... and I think it's safe to say that'll never happen (hehe).

Famous last words. The only thing that is bugging me is that we don't have a test.

@sappelhoff
Copy link
Member

Famous last words. The only thing that is bugging me is that we don't have a test.

yes, fair enough (that's also why I typed hehe). Could we add a check to the top of write_brainvision that errors when an array is not in native byte order? That way we would cover all cases, right? The only untested thing (pragma: no cover) would then be the if-clause to convert a native byte order array on a big-endian machine to little-endian --> and that's something we could test by adding a new CI on a big-endian architecture. (we discussed that previously but decided against it, but we can reconsider)

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 28, 2021

Could we add a check to the top of write_brainvision that errors when an array is not in native byte order? That way we would cover all cases, right?

Ideally, we should be able to write a file (in LE) even with a BE array on both architectures. In other words, we should always write a LE data file no matter which array we throw at it, independent of the host architecture.

The only untested thing (pragma: no cover) would then be the if-clause to convert a native byte order array on a big-endian machine to little-endian --> and that's something we could test by adding a new CI on a big-endian architecture. (we discussed that previously but decided against it, but we can reconsider)

I'm still -1 on adding a new CI, but the if-clause is also True if the array is BE (independent of the architecture), so we should be able to test it.

@sappelhoff
Copy link
Member

Okay, but that brings us back to your comment from #80 (comment)

Therefore, I would try to change operations involving copies with in-place variants if possible. However, this will probably require a ton of modifications that I'm not sure we want to maintain.

So we would need to make sure an array keeps its byte order as it passes through write_brainvision, see #80 (comment)

If that's possible, then yes, we can test everything and ensure that passing any byte ordered array on any architecture results in (correctly scaled/processed) little endian format.

@Aniket-Pradhan
Copy link
Contributor Author

So we would need to make sure an array keeps its byte order as it passes through write_brainvision, see #80 (comment)

One shabby solution I was thinking, is that we could reset the dtype of data every time it is assigned a new copy. That way, we retain the appropriate dtype until we reach the final if condition. Although, that would be a lot of overhead for a very far-off case. Furthermore, we would have to use the appropriate dtype from the supported formats.

I am sure we can arrive at a much better solution than this, but I cannot think of anything better right now.

@sappelhoff
Copy link
Member

One shabby solution I was thinking, is that we could reset the dtype of data every time it is assigned a new copy.

It's pragmatic, I like that. And maybe some of the times we can just avoid making a copy and thus retain the dtype without the "shabby solution" :-)

Furthermore, we would have to use the appropriate dtype from the supported formats.

you mentioned that before, could you please elaborate why that would be a problem?

@cbrnr
Copy link
Collaborator

cbrnr commented Sep 29, 2021

OK, so I looked at the changes again and it seems like we can simplify the condition to:

if sys.byteorder == "big" and byteorder == "=":  # pragma: no cover
        data = data.byteswap()

The case byteorder == ">" never occurs because we're doing some copying before, which changes the array to native order.

Therefore, unless we change a lot of code just for the sake of testing on an LE architecture, we really have to mark this line as untestable. However, I think that's fine – sorry if this was clear for you previously, I thought we could somehow create a test that works on LE without changing too much.

We can always add this later if we think it is necessary.

@cbrnr cbrnr merged commit 0088662 into bids-standard:main Sep 29, 2021
@cbrnr
Copy link
Collaborator

cbrnr commented Sep 29, 2021

Perfect! Thanks @Aniket-Pradhan and @sappelhoff!

@sappelhoff
Copy link
Member

@Aniket-Pradhan please run your tests and make sure everything passes :-) ... then notify me and I'll release pybv 0.6.0

@Aniket-Pradhan
Copy link
Contributor Author

ohh, looks like I am late to the party 😛

It's pragmatic, I like that. And maybe some of the times we can just avoid making a copy and thus retain the dtype without the "shabby solution" :-)

As @cbrnr pointed out, data would automatically be converted to the native architecture before it reaches the if condition. Therefore unless we want to modify a chunk of code just to test a possibility (which should be working anyways, because it is automatically converted to the native arch beforehand) doesn't seem feasible to me.

That's why I was calling it a "shabby" solution, because it is not really needed, lol. Anyways, thanks for being supportive. Let me know if you plan to carry out with a similar solution, and I would be happy to help 😄

Furthermore, we would have to use the appropriate dtype from the supported formats.

you mentioned that before, could you please elaborate why that would be a problem?

I basically meant that we would have to specify the endianness in the supported formats if we were to proceed with testing the code. 😛

@Aniket-Pradhan please run your tests and make sure everything passes :-) ... then notify me and I'll release pybv 0.6.0

I'll run the tests on travis, and hopefully it should work fine. I'll ping here as soon as the build finishes.

Thanks a lot @cbrnr and @sappelhoff for your help. You all rock!

@Aniket-Pradhan
Copy link
Contributor Author

@sappelhoff everything looks green. We are good to go! :D

Build Status

@sappelhoff
Copy link
Member

https://pypi.org/project/pybv/0.6.0/ 🎉 feel free to draft a tweet @Aniket-Pradhan or @cbrnr or @hoechenberger - I'll like and retweet of course 😉 (but no energy to draft one myself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests fail with big-endian machines
3 participants